html 如何提取2个标题标签之间的数据？

cwxwcias 于 2023-02-02 发布在其他

关注(0)|答案(1)|浏览(146)

我的工作刮网站，我想提取数据之间的2个标题和标签，它的第一个标签作为键值对。
如何提取标题下的文本（如h1和h2）？

soup = BeautifulSoup(page.content, 'html.parser')
items = soup.select("div.conWrap")

htag_count = []
item_header = soup.find_all(re.compile('^h[1-6]'))
for item in item_header:
    htag_count.append({item.name:item.text})

print(htag_count)

Html

来源：https://stackoverflow.com/questions/75256667/how-to-extract-the-data-in-between-2-header-tags

1条答案

按热度按时间

oogrdqng1#

如果h_标记不共享直接父标记，则此操作将不起作用，但您可以尝试在每个h_标记之后循环兄弟标记[并在到达下一个h_标记时停止]。

# url = 'https://en.wikipedia.org/wiki/Chris_Yonge' [ for example ]
# soup = BeautifulSoup(requests.get(url).content)

# item_header = soup.find_all(re.compile('^h[1-6]')) # should be same as
item_header = soup.find_all([f'h{i}' for i in range(1,7)])

skipTags = ['script', 'style'] # any tags you don't want text from
hSections = []

for h in item_header:
    sectionLines = []

    for ns in  h.find_next_siblings():
        if ns in item_header: break # stop if/when next header is reached
        if ns.name in skipTags: continue # skip certain tags

        sectionLines.append(' '.join(ns.get_text(' ').split())) 
        # [ split+join to minimize whitespace ] 

    hSections.append({
        'header_type': h.name, 'header_text': h.get_text(' ').strip(),
        'section_text': '\n'.join([l for l in sectionLines if l])
    })

我无法正确地测试这个，因为你没有包含任何html片段，也没有链接到你想刮取的网站，但当尝试在维基百科页面上，hSections（截断和制表后）看起来像：

如果您对将子节嵌套到父节感兴趣，也可以查看this solution。

赞(0）回复(0）举报 2023-02-02

我来回答

html 如何提取2个标题标签之间的数据？

1条答案

相关问题

热门标签

最新问答