html 如何修复这个网页抓取程序？

dvtswwa3 于 2022-12-16 发布在其他

关注(0)|答案(1)|浏览(125)

我是一个完全的初学者与python，我写了这个程序来刮和查找关闭赔率的NHL游戏的分数网站，并把数据放在一个文件中.该程序运行，但由于某种原因，只有2场比赛出约200我尝试显示不正确的数据.
我想这是因为我是如何在一个div中搜索div的，我编写了代码，以一种只存储最后一个div（这是我想要抓取的div）的方式返回数据。
另外，我确信我写文件的方式在运行时很糟糕，有没有更好的方法？

import requests
from bs4 import BeautifulSoup

# Function to scrape web and find the game title and closing odds
def get_match_data(url_val):
    # Set up html parser
    response = requests.get(url_val)
    html = response.text
    soup = BeautifulSoup(response.content, "html.parser")
    # Scrape for header which is "matchtitle"
    matchtitle = soup.find('h1',{'class': "sr-only"})
   
    # Code to find div and search for div within
    divs = soup.find('div',{'class': 'col-sm-4'})
    for tag in divs:
        # find div
        target = tag.find_all("div", {"class","GameDetailsCard__row--3rKYp"})
        for tag in target:
            # find divs within target div
            odds = tag.find("div", {"class","GameDetailsCard__content--2L_KF"})
    # Call write_to_file -> add data scraped from web
    write_to_file(matchtitle.text +" "+ odds.text)

# Code to pass multiple urls to scrape for different games
def multi_games_url_handler(link):
    for x in range(26500, 26715):
        #print(x)
        url = link + str(x)
        #print(url)
        get_match_data(url)
        
def write_to_file(game_data):
    file = open("NHL_GAMES.txt","a")
    file.write(game_data +"\n")
    file.close

### Main(void) ?? idk what to call this portion of code not a python savant
# Fetch the webpage
link = "https://www.thescore.com/nhl/events/"
multi_games_url_handler(link)

下面是文本文件中包含正确数据的一行：
多伦多枫叶队@新泽西魔鬼队于2022年11月24日NJD-140，o/u 6. 5
这里有一个错误的数据
2022年12月7日卡罗莱纳州飓风@阿纳海姆鸭贾斯汀·圣皮埃尔，克里斯·李
只有2/215是这样错的。

Html

来源：https://stackoverflow.com/questions/74806542/how-to-fix-this-web-scraping-program

1条答案

按热度按时间

twh00eeo1#

它看起来像某些NHL游戏网页ex：卡罗莱纳不包含一节的'赔率'，这可能是由于当时是OT游戏？无论如何，最好的办法是添加一个子句来处理'没有发现赔率'。我已经更新了一些您的代码如下：

import requests
from bs4 import BeautifulSoup

# Function to scrape web and find the game title and closing odds
def get_match_data(url_val):
    results = []
    # Set up html parser
    response = requests.get(url_val)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    
    # Scrape for header which is "matchtitle"
    matchtitle = soup.find('h1',{'class': "sr-only"})
    target = soup.find_all("div", {"class","GameDetailsCard__row--3rKYp"})
    
    for tag in target:
        if "Odds" in str(tag.find("div", {"class":"GameDetailsCard__label--iBMhJ"})):
            odds = str(tag.find("div", {"class":"GameDetailsCard__content--2L_KF"}).text)
        else:
            odds = "No Odds found!"
        
    print(matchtitle.text + " " + odds)
    results.append(matchtitle.text + " " + odds)
    # Call write_to_file -> add data scraped from web
    write_to_file(results)
    
# Code to pass multiple urls to scrape for different games
def multi_games_url_handler(link):
    print("Getting game details...")
    for x in range(26500, 26715):
        #print(x)
        url = link + str(x)
        #print(url)
        get_match_data(url)
    
def write_to_file(game_data):
    with open("NHL_GAMES.txt", "a") as file:
        for line in game_data:
            file.write(line + "\n")

### Main(void) ?? idk what to call this portion of code not a python savant
# Fetch the webpage
link = "https://www.thescore.com/nhl/events/"
multi_games_url_handler(link)

赞(0）回复(0）举报 2022-12-16

我来回答

html 如何修复这个网页抓取程序？

1条答案

相关问题

热门标签

最新问答