python-3.x 如何在大写和括号处拆分

ffx8fchx 于 2022-12-01 发布在 Python

关注(0)|答案(2)|浏览(141)

我正在尝试解析歌词网站，我需要收集歌曲的歌词。我的输出有问题
我需要歌词显示如下enter image description here
我已经知道如何拆分大写文本，但还有一件事：括号拆分不正确，下面是我的代码：

import re
import requests
from bs4 import BeautifulSoup

r = requests.get('https://genius.com/Taylor-swift-lavender-haze-lyrics')
#print(r.status_code)
if r.status_code != 200:
    print('Error')
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.find_all('title')
titles = titles[0].text
titlist = titles.split('Lyrics | ')
titlist.pop(1)
titlist = titlist[0].replace("\xa0", " ")
print(titlist)
divs = soup.find_all('div', {'class' : 'Lyrics__Container-sc-1ynbvzw-6 YYrds'})
#print(divs[0].text)
lyrics = (divs[0].text)
res = re.findall(r'[A-Z][^A-Z]*', lyrics)
res_l = []
for el in res:
    res_l.append(el + '\n')
    print(el)

输出在屏幕截图上显示为雪花。如何修复？enter image description here
对于那些询问的人，添加了完整的代码

python-3.x

来源：https://stackoverflow.com/questions/74603507/how-to-split-at-uppercase-and-brackets

2条答案

按热度按时间

wnavrhmk1#

您可以.unwrap不必要的标签（<a>，<span>），以新行取代<br>，然后取得文字：

import requests
from bs4 import BeautifulSoup

url = "https://genius.com/Taylor-swift-lavender-haze-lyrics"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for t in soup.select("#lyrics-root [data-lyrics-container]"):

    for tag in t.select("span, a"):
        tag.unwrap()

    for br in t.select("br"):
        br.replace_with("\n")

    print(t.text)

印刷品：

[Intro]
Meet me at midnight

[Verse 1]
Staring at the ceiling with you
Oh, you don't ever say too much
And you don't really read into
My melancholia

[Pre-Chorus]
I been under scrutiny (Yeah, oh, yeah)
You handle it beautifully (Yeah, oh, yeah)
All this shit is new to me (Yeah, oh, yeah)

...and so on.

赞(0）回复(0）举报 2022-12-01

kulphzqa2#

因为括号在正则表达式中有含义，所以你需要转义它们。在Python中，你应该能够使用\[来得到你想要的。

赞(0）回复(0）举报 2022-12-01