python-3.x 如何在大写和括号处拆分

ffx8fchx  于 2022-12-01  发布在  Python
关注(0)|答案(2)|浏览(141)

我正在尝试解析歌词网站,我需要收集歌曲的歌词。我的输出有问题
我需要歌词显示如下enter image description here
我已经知道如何拆分大写文本,但还有一件事:括号拆分不正确,下面是我的代码:

import re
import requests
from bs4 import BeautifulSoup

r = requests.get('https://genius.com/Taylor-swift-lavender-haze-lyrics')
#print(r.status_code)
if r.status_code != 200:
    print('Error')
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.find_all('title')
titles = titles[0].text
titlist = titles.split('Lyrics | ')
titlist.pop(1)
titlist = titlist[0].replace("\xa0", " ")
print(titlist)
divs = soup.find_all('div', {'class' : 'Lyrics__Container-sc-1ynbvzw-6 YYrds'})
#print(divs[0].text)
lyrics = (divs[0].text)
res = re.findall(r'[A-Z][^A-Z]*', lyrics)
res_l = []
for el in res:
    res_l.append(el + '\n')
    print(el)

输出在屏幕截图上显示为雪花。如何修复?enter image description here
对于那些询问的人,添加了完整的代码

wnavrhmk

wnavrhmk1#

您可以.unwrap不必要的标签(<a><span>),以新行取代<br>,然后取得文字:

import requests
from bs4 import BeautifulSoup

url = "https://genius.com/Taylor-swift-lavender-haze-lyrics"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for t in soup.select("#lyrics-root [data-lyrics-container]"):

    for tag in t.select("span, a"):
        tag.unwrap()

    for br in t.select("br"):
        br.replace_with("\n")

    print(t.text)

印刷品:

[Intro]
Meet me at midnight

[Verse 1]
Staring at the ceiling with you
Oh, you don't ever say too much
And you don't really read into
My melancholia

[Pre-Chorus]
I been under scrutiny (Yeah, oh, yeah)
You handle it beautifully (Yeah, oh, yeah)
All this shit is new to me (Yeah, oh, yeah)

...and so on.
kulphzqa

kulphzqa2#

因为括号在正则表达式中有含义,所以你需要转义它们。在Python中,你应该能够使用\[来得到你想要的。

相关问题