pandas 使用Python进行网页抓取,从“a”元素中获取href

fhg3lkii  于 2023-02-27  发布在  Python
关注(0)|答案(1)|浏览(138)

使用下面的代码,我可以从给定URL的指定数量的页面中获取所有数据:

import pandas as pd

F, L = 1, 2 # first and last pages

dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df
    
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')

但我需要得到运动员的代码(字段“竞争对手”)。
如何插入包含每个竞争对手的href的字段?

ycl3bljg

ycl3bljg1#

我真的不知道为什么你要在代码中做所有的事情,但是为了从链接中获得该页面上的表,并为竞争对手的代码增加一列,我会这样做(在这个例子中,只是针对第一页,但显然你可以扩展它):

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req =  requests.get(url)

#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,"html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]

#we then insert the codes as a new column in the df
sub_df.insert(3, 'Code', codes)

现在,在Competitor之后应该有了一个新列,您可以删除任何不需要的列,添加其他列等等。

相关问题