使用Python Scrapy提取足球直播站点中的XPATH

yhuiod9q  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(176)

我正在尝试使用Scrapy返回SofaScore中现场比赛的结果和统计数据。
站点:https://www.sofascore.com/
下面的代码:

import scrapy

class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['http://sofascore.com/']

    def parse(self, response):
        time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
        print(time1)
        pass

我也试着用response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall(),但是它什么也没返回。我用了很多不同的xpath,但是它都没有返回。我做错了什么?
比如,今天10/06页面上的第一场比赛是法国对奥地利,xpath:/html/正文/div[1]/主要/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div

z31licg0

z31licg01#

数据是用JavaScript生成的,但也可以从API获取。
在浏览器中打开devtools,点击network标签,然后点击live按钮,查看它加载数据的位置,然后查看JSON文件,查看其结构。

import scrapy

class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
    custom_settings = {'DOWNLOAD_DELAY': 0.4}

    def start_requests(self):
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Host": "api.sofascore.com",
            "Origin": "https://www.sofascore.com",
            "Pragma": "no-cache",
            "Referer": "https://www.sofascore.com/",
            "Sec-Fetch-Dest": "empty",
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-site",
            "Sec-GPC": "1",
            "TE": "trailers",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        yield scrapy.Request(url=self.start_urls[0], headers=headers)

    def parse(self, response):
        events = response.json()
        events = events['events']
        # now iterate throught the list and get what you want from it
        # example:
        for event in events:
            yield {
                'event name': event['tournament']['name'],
                'time': event['time']
            }

相关问题