pandas 使用Python进行网页抓取,从“a”元素中获取href

fhg3lkii 于 2023-02-27 发布在 Python

关注(0)|答案(1)|浏览(138)

使用下面的代码，我可以从给定URL的指定数量的页面中获取所有数据：

import pandas as pd

F, L = 1, 2 # first and last pages

dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df
    
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')

但我需要得到运动员的代码（字段“竞争对手”）。
如何插入包含每个竞争对手的href的字段？

pandas

来源：https://stackoverflow.com/questions/75549289/web-scraping-with-python-get-href-from-a-elements

1条答案

按热度按时间

ycl3bljg1#

我真的不知道为什么你要在代码中做所有的事情，但是为了从链接中获得该页面上的表，并为竞争对手的代码增加一列，我会这样做（在这个例子中，只是针对第一页，但显然你可以扩展它）：

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req =  requests.get(url)

#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,"html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]

#we then insert the codes as a new column in the df
sub_df.insert(3, 'Code', codes)

现在，在Competitor之后应该有了一个新列，您可以删除任何不需要的列，添加其他列等等。

赞(0）回复(0）举报 2023-02-27

我来回答

pandas 使用Python进行网页抓取,从“a”元素中获取href

1条答案

相关问题

热门标签

最新问答