pandas 如何迭代元素和抓取数据到 Dataframe ?

pftdvrlh  于 2022-12-21  发布在  其他
关注(0)|答案(2)|浏览(104)

尝试了解在python中循环未格式化为表(tr/td)的数据的最佳方式
示例数据:
https://www.nhlpa.com/the-pa/certified-agents?range=A-Z

尝试创建一个表的名称,头像URL,公司,地址,教育。
到目前为止,正在尝试执行以下操作,但似乎无法理解如何进入内容组件的div:

r=requests.get(url)
soup=BeautifulSoup(r.text, 'html5lib')
table = soup.find_all('div', attrs = {'class':'col-lg-6 agent'}) 
for a in table:
    if a.find('div', attrs = {'headshot'}):
        headshot_url=a.find('div', attrs = {'headshot'}).img```
fhg3lkii

fhg3lkii1#

只需迭代所有代理并选择特定信息,将它们存储在一个dicts列表中:

for e in soup.select('.agent'):
    data.append({
        'name':e.h3.get_text(strip=True).replace('\xa0',' '),
        'headshot_url':e.img.get('src'),
        'company':e.h5.get_text(strip=True),
        'address':e.address.get_text(strip=True) if e.address else None,
        'education':e.select_one('.education div+div').get_text(strip=True)
    })

这可以转换成 Dataframe 。

示例
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.nhlpa.com/the-pa/certified-agents?range=A-Z'
r = requests.get(url)
soup = BeautifulSoup(r.text)

data = []

for e in soup.select('.agent'):
    data.append({
        'name':e.h3.get_text(strip=True).replace('\xa0',' '),
        'headshot_url':e.img.get('src'),
        'company':e.h5.get_text(strip=True),
        'address':e.address.get_text(strip=True) if e.address else None,
        'education':e.select_one('.education div+div').get_text(strip=True)
    })
pd.DataFrame(data)
输出

| | 姓名|头像_url|连|地址|教育|
| - ------|- ------|- ------|- ------|- ------|- ------|
| 无|韦德·阿诺特|https://cdn.nhlpa.com/img/assets/agents/headshots/48x48/9207.jpg|纽波特体育管理公司|加拿大安大略省密西沙加市中心大道201号400室,L5 B 2 T4|协和法学院法学博士。|
| | | | | | 威尔弗里德·劳里埃大学,公共汽车管理学荣誉,1991年。|
| 1个|帕特里克·阿伦松|https://cdn.nhlpa.com/img/assets/agents/headshots/48x48/56469.jpg| AC曲棍球|Faktorvagen 17瑞典皇家银行,43437|没有。|
| 第二章|舒米·巴巴耶夫|https://cdn.nhlpa.com/img/assets/agents/headshots/48x48/56794.jpg|舒米·巴巴耶夫机构||莫斯科矿业大学(莫斯科),1989-1994年----硕士学位|
| 三个|米卡·贝克曼|https://cdn.nhlpa.com/img/assets/agents/headshots/48x48/58054.jpg| WSG芬兰有限公司|芬兰埃斯波卡佩利库贾6号C 02200|赫尔辛基大学法学院(1992-1998年)----法学硕士|
...

4c8rllxm

4c8rllxm2#

希望这对〈3有帮助

r=requests.get(url)
print("fetched")
soup=BeautifulSoup(r.text, 'html.parser')
table = soup.find_all('div', attrs = {'class':'col-lg-6 agent'}) 
for a in table:
    headshots=a.find('div', attrs = {'headshot'})
    #find all divs with headshot class
    if headshots:
        #check if not None
        headshot_url=headshots.img["src"]
        #get the url
    else:
        headshot_url=None
        #So nothing gets wrong with our data sets
    
    content=a.find('div', attrs = {'content'})
    #find all divs with content class
    if content:
        #check if the div actually exist
        if content.h3:
            name=str(content.h3.contents[0]).replace("\xa0"," ")
        else:
            name=None
        if content.h5:
            company=content.h5.contents[0]
        else:
            company=None
    else:
        name,company=None,None
        #if content is None, then by default both of these None
    html_address=content.address
    
    if html_address:
        address=html_address.contents[0]
        #You might wanna edit this if you want
    else:
        address=None
    
    edu=a.find("div",attrs={'education'}).find("div",attrs={"class":None})
    #find all divs with education class
    
    if edu:
        education=edu.contents[0]
        
    else:
        education=None
    
    #YOUR FINAL DATA SET IS:
    data_set={"headshot_url":headshot_url,"name":name,"company":company,'address':address,'education':education}

相关问题