python-3.x 如何从html中提取特定的数据

ykejflvf  于 2023-02-14  发布在  Python
关注(0)|答案(1)|浏览(164)
from bs4 import BeautifulSoup

html_content = """<div class="sectionDiv expanded">
<div class="Title expanded ToggleSection shead"
  style="margin-top:1em"
 id="sect_s1Header">
<span class="sectionTitle">Issue details:</span></div>
<hr />
<div><!--The div around the table is so that the toggling can be animated smoothly-->
<table id="sect_s1" class="formSection LabelsAbove">
<tr class="formRow ">
<td id="tdl_8" class="label lc" >
<label class="fieldLabel " ><b >Address</b></label>
 <table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
 <tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
    <div id="tdf_8" class="cell cc" >
    <a
href="https://maps.google.com/?q=1183+Pelham+Wood+Dr%2C+Rock+Hill%2C+SC+29732">1183
Pelham Wood Dr, Rock Hill, SC 29732</a>
</span></div>
</td></tr></table>
</td>
<td id="tdl_9" class="label lc" colspan=100>
<label class="fieldLabel " ><b >Dispatch Region</b></label>

 <table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
 <tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
    <div id="tdf_9" class="cell cc nowrap" >5</span></div>
</td></tr></table>
</td>
</tr>
<div><!--The div around the table is so that the toggling can be animated smoothly-->
<table id="sect_s19" class="formSection LabelsLeft">

<tr class="formRow ">
<td id="tdl_52" class="label lc"
     
     style="vertical-align:top; border:0 solid white; border-bottom-width: .83em; padding: 0.83em 0; border-right-width: .83em;"
     
><label class="fieldLabel" ><b>Preference 1: </b></label></td>
<td id="tdf_52"  class="cell cc nowrap"
     
     style="background-color:#f5f5f5; border: solid white; border-top-width:0;border-right-width: 5.455em; border-left-width:.909em; vertical-align:top; border-bottom-width: .909em; padding: .83em;"
     
 >Friday, 02-03</td>"""

try:
    soup = BeautifulSoup(html_content, 'html.parser')
    section_data = soup.find_all("div",{"class":"sectionDiv expanded"})
    for datas in section_data:
        labels = datas.find_all("label",{"class":"fieldLabel"})
        for item in labels:
            label = item.text
            print(label)

except Exception as e:
    print(e)

要求输出:
地址:1183佩勒姆伍德博士,岩石山,SC 29732;发货区域:5首选项1:星期五,2 - 3
是否有任何解决方案来获得特定的输出。我需要循环div sectionDiv展开并提取细节。我得到了所有的标签,但我无法获得实体。是否有任何解决方案来获得数据。

xtfmy6hx

xtfmy6hx1#

您可以使用dict来保存标签和值,并按标签选取:

dict(e.stripped_strings for e in soup.select('td:has(label)'))

或者在选择元素时更加严格:

soup.select_one('td:has(label:-soup-contains("Address"))').get_text(':', strip=True)

编辑

根据问题中的注解和更改的输入,使用.find_next()

dict((e.text.strip(': '),' '.join(e.find_next('td').text.split())) for e in soup.select('label'))
示例
from bs4 import BeautifulSoup

html = '''
<div class="sectionDiv expanded"><div class="Title expanded ToggleSection shead" style="margin-top:1em" id="sect_s1Header"><span class="sectionTitle">Issue details:</span></div>
<hr /><div><!--The div around the table is so that the toggling can be animated smoothly--><table id="sect_s1" class="formSection LabelsAbove"><tr class="formRow "><td id="tdl_8" class="label lc" ><label class="fieldLabel " ><b >Address</b></label> <table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;"> <tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;"><div id="tdf_8" class="cell cc" ><a href="https://maps.google.com/?q=1183+Pelham+Wood+Dr%2C+Rock+Hill%2C+SC+29732">1183
Pelham Wood Dr, Rock Hill, SC 29732</a>
</span></div></td></tr></table></td><td id="tdl_9" class="label lc" colspan=100>
<label class="fieldLabel " ><b >Dispatch Region</b></label>
 <table class="EmailFieldPadder" border="0" cellspacing="0" cellpadding="0" valign="top" style="width:98%;margin-top:.3em;margin-right:1.5em;">
 <tr><td class="EmailDivWrapper" style="background-color:#f5f5f5;padding: 0.83em;border-radius:3px;margin:0;border:0px;">
    <div id="tdf_9" class="cell cc nowrap" >5</span></div>
</td></tr></table></td></tr><div><!--The div around the table is so that the toggling can be animated smoothly-->
<table id="sect_s19" class="formSection LabelsLeft">
<tr class="formRow ">
<td id="tdl_52" class="label lc" style="vertical-align:top; border:0 solid white; border-bottom-width: .83em; padding: 0.83em 0; border-right-width: .83em;"><label class="fieldLabel" ><b>Preference 1: </b></label></td>
<td id="tdf_52"  class="cell cc nowrap" style="background-color:#f5f5f5; border: solid white; border-top-width:0;border-right-width: 5.455em; border-left-width:.909em; vertical-align:top; border-bottom-width: .909em; padding: .83em;" >Friday, 02-03</td>'''
soup = BeautifulSoup(html)
   
dict((e.text.strip(': '),' '.join(e.find_next('td').text.split())) for e in soup.select('label'))
输出
{'Address': '1183 Pelham Wood Dr, Rock Hill, SC 29732',
 'Dispatch Region': '5',
 'Preference 1': 'Friday, 02-03'}

相关问题