html 如何修复这个网页抓取程序?

dvtswwa3  于 2022-12-16  发布在  其他
关注(0)|答案(1)|浏览(125)

我是一个完全的初学者与python,我写了这个程序来刮和查找关闭赔率的NHL游戏的分数网站,并把数据放在一个文件中.该程序运行,但由于某种原因,只有2场比赛出约200我尝试显示不正确的数据.
我想这是因为我是如何在一个div中搜索div的,我编写了代码,以一种只存储最后一个div(这是我想要抓取的div)的方式返回数据。
另外,我确信我写文件的方式在运行时很糟糕,有没有更好的方法?

import requests
from bs4 import BeautifulSoup

# Function to scrape web and find the game title and closing odds
def get_match_data(url_val):
    # Set up html parser
    response = requests.get(url_val)
    html = response.text
    soup = BeautifulSoup(response.content, "html.parser")
    # Scrape for header which is "matchtitle"
    matchtitle = soup.find('h1',{'class': "sr-only"})
   
    # Code to find div and search for div within
    divs = soup.find('div',{'class': 'col-sm-4'})
    for tag in divs:
        # find div
        target = tag.find_all("div", {"class","GameDetailsCard__row--3rKYp"})
        for tag in target:
            # find divs within target div
            odds = tag.find("div", {"class","GameDetailsCard__content--2L_KF"})
    # Call write_to_file -> add data scraped from web
    write_to_file(matchtitle.text +" "+ odds.text)

# Code to pass multiple urls to scrape for different games
def multi_games_url_handler(link):
    for x in range(26500, 26715):
        #print(x)
        url = link + str(x)
        #print(url)
        get_match_data(url)
        
def write_to_file(game_data):
    file = open("NHL_GAMES.txt","a")
    file.write(game_data +"\n")
    file.close

### Main(void) ?? idk what to call this portion of code not a python savant
# Fetch the webpage
link = "https://www.thescore.com/nhl/events/"
multi_games_url_handler(link)

下面是文本文件中包含正确数据的一行:
多伦多枫叶队@新泽西魔鬼队于2022年11月24日NJD-140,o/u 6. 5
这里有一个错误的数据
2022年12月7日卡罗莱纳州飓风@阿纳海姆鸭贾斯汀·圣皮埃尔,克里斯·李
只有2/215是这样错的。

twh00eeo

twh00eeo1#

它看起来像某些NHL游戏网页ex:卡罗莱纳不包含一节的'赔率',这可能是由于当时是OT游戏?无论如何,最好的办法是添加一个子句来处理'没有发现赔率'。我已经更新了一些您的代码如下:

import requests
from bs4 import BeautifulSoup

# Function to scrape web and find the game title and closing odds
def get_match_data(url_val):
    results = []
    # Set up html parser
    response = requests.get(url_val)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    
    # Scrape for header which is "matchtitle"
    matchtitle = soup.find('h1',{'class': "sr-only"})
    target = soup.find_all("div", {"class","GameDetailsCard__row--3rKYp"})
    
    for tag in target:
        if "Odds" in str(tag.find("div", {"class":"GameDetailsCard__label--iBMhJ"})):
            odds = str(tag.find("div", {"class":"GameDetailsCard__content--2L_KF"}).text)
        else:
            odds = "No Odds found!"
        
    print(matchtitle.text + " " + odds)
    results.append(matchtitle.text + " " + odds)
    # Call write_to_file -> add data scraped from web
    write_to_file(results)
    
# Code to pass multiple urls to scrape for different games
def multi_games_url_handler(link):
    print("Getting game details...")
    for x in range(26500, 26715):
        #print(x)
        url = link + str(x)
        #print(url)
        get_match_data(url)
    
def write_to_file(game_data):
    with open("NHL_GAMES.txt", "a") as file:
        for line in game_data:
            file.write(line + "\n")

### Main(void) ?? idk what to call this portion of code not a python savant
# Fetch the webpage
link = "https://www.thescore.com/nhl/events/"
multi_games_url_handler(link)

相关问题