用BS 4- Python实现跨度的排除

dgjrabp2 于 2022-12-25 发布在 Python

关注(0)|答案(2)|浏览(101)

所以我尝试排除（不是提取）包含在span中的信息。

<li><span>Type:</span> Cardiac Ultrasound</li>

下面是我的代码：

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
        description_elements = description_el.find('span')
        for el in description_elements: 
            curr_el = {}
            key = el.replace(':', '')
            print(el)
            print(description_el.text.replace(' ', ''))

其中列出汤基本上是整个页面（在我的例子HTML）当我这样做，我得到：

Type:
Type: CardiacUltrasound

正如您所看到的，由于某些特殊的原因：P，span不受我的replace()方法的影响，即使.text生成str
编辑：抱歉，我的目标是创建一堆dictionnaries，其中key是span，value是它后面的span。

python

来源：https://stackoverflow.com/questions/71439683/exclusion-of-span-with-bs4-python

2条答案

按热度按时间

nx7onnlm1#

注意：创建一堆字典时要小心，因为字典不能有重复的键。但是你可以有一个字典列表，在这种情况下，这并不重要（在每个字典中仍然重要）。

- 备选案文1：**

使用.next_sibling()

from bs4 import BeautifulSoup

html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':', '')
    v = description_el.find('span').next_sibling.strip()
    
    print(k)
    print(v)

- 备选案文2：**

只需要从description_el中得到文本，.split(':')，然后你就得到了你想要的2个元素（如果我没理解错你的问题的话）。

from bs4 import BeautifulSoup

html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    descText = description_el.text.split(':', 1)
    k = descText[0].strip()
    v = descText[-1].strip()
    
    print(k)
    print(v)

- 备选案文3：**

获取<span>文本，将其删除，然后获取<li>中剩余的文本。尽管由于您不想提取，可能对您没有用。

from bs4 import BeautifulSoup

html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''

listing_soup = BeautifulSoup(html, 'html.parser')

item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos: 
    k = description_el.find('span').text.replace(':','')
    description_el.find('span').extract()
    v = description_el.text.strip()
    
    print(k)
    print(v)

- 输出：**

Type
Cardiac Ultrasound

赞(0）回复(0）举报 2022-12-25

j7dteeu82#

要提取不包括子标签内容的标签文本，可以使用this应答中的方法。通常情况下，只需迭代<li>标签，并从包含子标签<span>的标签中获取文本。
代码：

from bs4 import BeautifulSoup, NavigableString

html = """<html><body>
<li><span>Key1:</span> Value1</li>
<li><span>Key2:</span> Value2</li>
<li><NoKeyValue</li>
<li><span>Key3:</span> Value3</li>
<li><span>Key4:</span> Value4</li>
</body></html>"""

result = {}
for li in BeautifulSoup(html, "html.parser").find_all("li"):
    span = li.find("span")
    if span:
        result[span.text.strip(" :")] = \
            "".join(e for e in li if isinstance(e, NavigableString)).strip()

赞(0）回复(0）举报 2022-12-25

我来回答

用BS 4- Python实现跨度的排除

2条答案

相关问题

热门标签

最新问答