如何在文本python中删除所有应答器

pxq42qpu  于 2022-11-27  发布在  Python
关注(0)|答案(4)|浏览(136)

我想从一个标签中提取数据来简单地检索文本。不幸的是,我不能只提取文本,我总是在这个标签中有链接。
是否可以删除文本中的所有<img><a href>标记?

<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>

我只想收回这个:its a good day并忽略<div>标记中<a href>标记的内容
目前我通过beautifulsoup.find('div)执行提取

balp4ylt

balp4ylt1#

尝试执行此操作

import requests
from bs4 import BeautifulSoup

#response = requests.get('your url')

html = BeautifulSoup('''<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a> 
</div>''', 'html.parser')

soup = html.find_all(class_='xxx')

print(soup[0].text.split('\n')[0])
rqmkfv5c

rqmkfv5c2#

让我们导入re并使用re.sub

import re 

s1 = '<div class="xxx" data-handler="xxx">its a good day'
s2 = '<a class="link" href="https://" title="text">https:// link</a></div>'
    
    
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)

输出

>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''
u5rb5r59

u5rb5r593#

编辑

根据您的意见,应该捕获<a>之前的所有文本,而不仅仅是元素中的第一个文本,选择所有previous_siblings并检查NavigableString

' '.join(
    [s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)

示例

from bs4 import Tag, NavigableString, BeautifulSoup

html='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)

' '.join(
    [s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)

要仅关注文本而不是元素的子标记,可以用途:

.find(text=True)

如果模式始终相同,并且文本是元素中内容的第一部分:

.contents[0]

示例

from bs4 import BeautifulSoup
html='''
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
'''

soup = BeautifulSoup(html)

soup.div.find(text=True).strip()
输出
its a good day
wsxa1bj1

wsxa1bj14#

因此,基本上,您不希望在<a>标记中包含任何文本,也不希望在所有标记中包含任何内容。

from bs4 import BeautifulSoup

html1='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a></div>
'''
html2 = ''' <div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div> '''

html3 = ''' <div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a><div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div></div> '''

soup = BeautifulSoup(html3,'html.parser')

for t in soup.find_all('a', href=True):
    t.decompose()
test = soup.find('div',class_='xxx').getText().strip()

print(test)

输出:

#for html1: New wallpaper Find over 100+ of
#for html2: its a good day
#for html3: New wallpaper Find over 100+ of its a good day

相关问题