pandas 在未进行的比赛中未加载刮擦问题

osh3o9ms  于 2023-03-11  发布在  其他
关注(0)|答案(1)|浏览(288)

我将尝试尽可能地使其简洁。我正在从fbref中抓取一些足球数据,并且在将尚未进行的比赛加载到ML的数据集中时遇到了问题。我希望获得的示例在“得分和赛程”表下。
https://fbref.com/en/squads/18bb7c10/Arsenal-Stats
当我运行代码时,我能够刮取今年迄今为止玩过的所有游戏,但不是我试图预测的即将到来的比赛。

{import requests
from bs4 import BeautifulSoup
import time

for year in years:
    data = requests.get(standings_url)
    soup = BeautifulSoup(data.text)
    standings_table = soup.select('table.stats_table')[0]

    links = [l.get("href") for l in standings_table.find_all('a')]
    links = [l for l in links if '/squads/' in l]
    team_urls = [f"https://fbref.com{l}" for l in links]
    
    previous_season = soup.select("a.prev")[0].get("href")
    standings_url = f"https://fbref.com{previous_season}"
    
    for team_url in team_urls:
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")
        data = requests.get(team_url)
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0]
        soup = BeautifulSoup(data.text)
        links = [l.get("href") for l in soup.find_all('a')]
        links = [l for l in links if l and 'all_comps/shooting/' in l]
        data = requests.get(f"https://fbref.com{links[0]}")
        shooting = pd.read_html(data.text, match="Shooting")[0]
        shooting.columns = shooting.columns.droplevel()
        try:
            team_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on="Date")
        except ValueError:
            continue
        team_data = team_data[team_data["Comp"] == "Premier League"]
        
        team_data["Season"] = year
        team_data["Team"] = team_name
        all_matches.append(team_data)
        time.sleep(10)

我已经尝试调整代码以包含更多的日期范围,并尝试去掉分隔它们的标题。

oyjwcjzk

oyjwcjzk1#

出于某种奇怪的原因,我不能用beautifulsoup来做这件事,但它可以用lxml和xpath来做--甚至使用相同的解析器......
要获取“分数和装置”表,请尝试以下操作:

from lxml import html as lh
import pandas as pd
import requests

url = "https://fbref.com/en/squads/18bb7c10/Arsenal-Stats"
req = requests.get(url)
doc = lh.fromstring(req.text)
table = doc.xpath('//div[@id="all_matchlogs"][.//caption[contains(.,"Fixtures")]]//table')[0]
df = pd.read_html(lh.tostring(table))[0]
df

输出应该是预期的表。

相关问题