scrapy 为什么json为空?

az31mfrm  于 2022-12-13  发布在  其他
关注(0)|答案(1)|浏览(170)

我在使用scrapy时遇到了问题。我在终端中用这些代码创建了nba.json(scrapy crawl nba-o nba.json),但是json是空的。我不知道为什么。另外,在此之前,我在另一个JSON文档中使用了这些代码,它工作正常。有人能帮助我吗?请提前感谢!

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "nba"
    start_urls = ["https://www.espn.com/nba/stats/_/season/2020/seasontype/2"]
    def parse(self, response):
        for content in response.xpath("//*[@id='fittPageContainer']/div[3]/div/div/section[1]/div/div[4]/div[1]/div/div[2]/div/div/div[2]/table/tbody/tr"):
            yield {
                "name" : content.xpath('td[1]/div/a/text()').get(),
                "team" : content.xpath('td[1]/div/span[2]/text()').get(),
                "ppg" : content.xpath('td[2]/text()').get()
            }

        next_page = response.xpath('').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)
plicqrtu

plicqrtu1#

其中一些信息是通过javascript呈现的。
你可以使用scrapy-playwright插件来获得渲染的内容。

pip install scrapy-playwright
playwright install

然后settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

那么在您的spider中,您只需要向请求添加剧作家 meta标记。
例如:

name = "nba"
    start_urls = ["https://www.espn.com/nba/stats/_/season/2020/seasontype/2"]
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={'playwright': True})

    def parse(self, response):
        for content in response.xpath("//div[@class='mb1']"):
            if content.xpath('./div/text()').get() == "Offensive Leaders":
                for table in content.xpath('.//div[@class="ResponsiveTable mt4"]'):
                    for row in table.xpath('.//tbody/tr'):
                        yield {
                        "name":  row.xpath('.//a/text()').getall(),
                        "team": row.xpath('.//span/text()').getall(),
                        "ppg": row.xpath('.//td[@class="Table__TD"]/text()').get()
                        }

输出:

2022-12-07 17:04:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.espn.com/nba/stats/_/season/2020/seasontype/2> (referer: https://www.espn.com/) ['playwright']
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'James Harden', 'team': 'HOU', 'ppg': '34.3'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Bradley Beal', 'team': 'WSH', 'ppg': '30.5'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Damian Lillard', 'team': 'POR', 'ppg': '30.0'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Trae Young', 'team': 'ATL', 'ppg': '29.6'}

相关问题