如何在文本python中删除所有应答器

pxq42qpu 于 2022-11-27 发布在 Python

关注(0)|答案(4)|浏览(136)

我想从一个标签中提取数据来简单地检索文本。不幸的是，我不能只提取文本，我总是在这个标签中有链接。
是否可以删除文本中的所有<img>和<a href>标记？

<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>

我只想收回这个：its a good day并忽略<div>标记中<a href>标记的内容
目前我通过beautifulsoup.find('div)执行提取

python

来源：https://stackoverflow.com/questions/74589261/how-to-remove-all-balise-in-text-python

4条答案

按热度按时间

balp4ylt1#

尝试执行此操作

import requests
from bs4 import BeautifulSoup

#response = requests.get('your url')

html = BeautifulSoup('''<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a> 
</div>''', 'html.parser')

soup = html.find_all(class_='xxx')

print(soup[0].text.split('\n')[0])

赞(0）回复(0）举报 2022-11-27

rqmkfv5c2#

让我们导入re并使用re.sub：

import re 

s1 = '<div class="xxx" data-handler="xxx">its a good day'
s2 = '<a class="link" href="https://" title="text">https:// link</a></div>'
    
    
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)

输出

>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''

赞(0）回复(0）举报 2022-11-27

u5rb5r593#

编辑

根据您的意见，应该捕获<a>之前的所有文本，而不仅仅是元素中的第一个文本，选择所有previous_siblings并检查NavigableString：

' '.join(
    [s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)

示例

from bs4 import Tag, NavigableString, BeautifulSoup

html='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)

' '.join(
    [s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)

要仅关注文本而不是元素的子标记，可以用途：

.find(text=True)

如果模式始终相同，并且文本是元素中内容的第一部分：

.contents[0]

示例

from bs4 import BeautifulSoup
html='''
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
'''

soup = BeautifulSoup(html)

soup.div.find(text=True).strip()

输出

its a good day

赞(0）回复(0）举报 2022-11-27

wsxa1bj14#

因此，基本上，您不希望在<a>标记中包含任何文本，也不希望在所有标记中包含任何内容。

from bs4 import BeautifulSoup

html1='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a></div>
'''
html2 = ''' <div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div> '''

html3 = ''' <div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a><div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div></div> '''

soup = BeautifulSoup(html3,'html.parser')

for t in soup.find_all('a', href=True):
    t.decompose()
test = soup.find('div',class_='xxx').getText().strip()

print(test)

输出：

#for html1: New wallpaper Find over 100+ of
#for html2: its a good day
#for html3: New wallpaper Find over 100+ of its a good day

赞(0）回复(0）举报 2022-11-27

我来回答

如何在文本python中删除所有应答器

4条答案

编辑

输出

相关问题

热门标签

最新问答