python 如何从特定的HTML文本中提取详细信息

fiei3ece  于 2023-02-11  发布在  Python
关注(0)|答案(3)|浏览(189)

从bs4导入美丽的汤

html_content = """<div id="formContents" class="dformDisplay ">
<div class="sectionDiv expanded">
<table id="sect_s1" class="formSection LabelsAbove">
<tr class="formRow ">
<td id="tdl_8" class="label lc" >
<label class="fieldLabel " ><b >Address</b></label>
 <table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
 <tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
    <div id="tdf_8" class="cell cc" >
    <a
href="https://maps.google.com/?q=1183+Pelham+Wood+Dr%2C+Rock+Hill%2C+SC+29732">1183
Pelham Wood Dr, Rock Hill, SC 29732</a>
</span></div>               
</td></tr></table>
</td>
"""
try:
    soup = BeautifulSoup(html_content, 'html.parser')
    form_data = soup.find("div",{"id":"formContents"})
    if form_data:
        section_data = soup.findAll("div",{"class":"sectionDiv expanded"})
        for datas in section_data:
            labels = datas.findAll("label",{"class":"fieldLabel"})
            for item in labels:
                labels = item.text
                print(labels)
                entity_data = item.findAll("td").text
                print(entity_data)

except Exception as e:
    print(e)

我要求的输出:

Address : 183 Pelham Wood Dr, Rock Hill, SC 29732.

是否有任何解决方案,以获得特定的输出使用beautifulsoup。我需要的地址,特定的HTML源内容。

euoag5mw

euoag5mw1#

  • 在较新的代码中,避免使用旧语法findAll(),而是将find_all()select()css selectors一起使用-有关详细信息,请花一分钟查看文档 *

您可以选择元素中包含<label>的所有<td>,然后使用stripped_strings提取内容-如果它与How to scrape data from the website which is not aligned properly中的动机相同,则可以获得结构良好的标签和文本dict

dict(e.stripped_strings for e in soup.select('#formContents td:has(label)'))
示例
from bs4 import BeautifulSoup
html_content = """<div id="formContents" class="dformDisplay ">        
<div class="sectionDiv expanded">
<div class="Title expanded ToggleSection shead"
  style="margin-top:1em"
 id="sect_s11Header">
<div><!--The div around the table is so that the toggling can be animated smoothly-->
<table id="sect_s1" class="formSection LabelsAbove">
<tr class="formRow ">
<td id="tdl_8" class="label lc" >
<label class="fieldLabel " ><b >Address</b></label>
 <table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
 <tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
    <div id="tdf_8" class="cell cc" >
    <a
href="https://maps.google.com/?q=1183+Pelham+Wood+Dr%2C+Rock+Hill%2C+SC+29732">1183
Pelham Wood Dr, Rock Hill, SC 29732</a>
</span></div>               
</td></tr></table>
</td>
"""

soup = BeautifulSoup(html_content)

dict(e.stripped_strings for e in soup.select('#formContents td:has(label)'))
输出
{'Address': '1183\nPelham Wood Dr, Rock Hill, SC 29732'}
icnyk63a

icnyk63a2#

您可以搜索a标签,其中hrefhttps://maps.google.com开头:

>>> soup.find('a', {'href': re.compile('^https://maps.google.com')}).text.replace('\n', ' ')
'1183 Pelham Wood Dr, Rock Hill, SC 29732'

这里重要的不是所使用的soup对象,而是使用regexp从标记中提取地址文本的策略。

d7v8vwbk

d7v8vwbk3#

当我尝试你的代码时,它会打印

Address
ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

您应该注意第二行,因为 * item.findAll("td").text * 总是会引发错误;您可以改为执行类似'\n'.join([td.text for td in item.findAll("td")])的操作,这不会引发任何错误。
但是,它只会返回一个空字符串[因为item.findAll("td")是一个空的ResultSet],因为使用 * for item in labels....item.findAll("td")... *,您要查找的是label标记 * 内部 * 的td标记,而实际上它们位于标签旁边table标记中。

溶液1:使用.find_next_siblings
soup = BeautifulSoup(html_content, 'html.parser')
form_data = soup.find("div",{"id":"formContents"})
if form_data:
    section_data = soup.find_all("div",{"class":"sectionDiv expanded"})
    for datas in section_data:
        labels = datas.find_all("label",{"class":"fieldLabel"})
        for item in labels: 
            print(item.text) ## label
            for nxtTable in item.find_next_siblings('table'):
                print('\n'.join([td.text for td in nxtTable.find_all("td")]))
                break ## [ only takes the first table ]

[Like这一点,您也不需要try...except。]
[对我来说]那印着

Address

1183
Pelham Wood Dr, Rock Hill, SC 29732
溶液2:将.selectCSS selectors配合使用
soup = BeautifulSoup(html_content, 'html.parser')

section_sel = 'div#formContents div.sectionDiv.expanded'
label_sel = 'label.fieldLabel'
for datas in soup.select(f'{section_sel}:has({label_sel}+table td)'):
    labels = datas.select(f'{label_sel}:has(+table td)')
    labels = [' '.join(l.get_text(' ').split()) for l in labels]
    entity_data = [' '.join([
        ' '.join(td.get_text(' ').split()) for td in ed.select('td')
    ]) for ed in datas.select(f'{label_sel}+table:has(td)')]
    # data_dict = dict(zip(labels, entity_data))
    for l, ed in zip(labels, entity_data): print(f'{l}: {ed}')

这应该可以打印出来

Address: 1183 Pelham Wood Dr, Rock Hill, SC 29732

顺便说一句,* dict(zip(labels, entity_data)) * 会返回{'Address': '1183 Pelham Wood Dr, Rock Hill, SC 29732'},我使用了' '.join(td.get_text(' ').split())而不仅仅是td.textlabels中的l也是如此)来最小化空格,并在一行中获得所有内容。

    • 注:**除非每个标签恰好用于一个表,否则两种解决方案都不太可靠;第二种解决方案假设表直接与标签相邻(并且将跳过没有具有td标签的相邻表的任何标签);并且第一种解决方案存在如果标签丢失其之后的表则从下一标签获取表的风险。

相关问题