Scrapy剧作家没有通过Scrapy发送下一个请求

7vhp5slm  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(152)

我有下面的例子

import json

import scrapy
import scrapy_playwright
from scrapy.selector import Selector
from scrapy_playwright.handler import Page, PageMethod

class GreetingsSpider(scrapy.Spider):
    name = "greetings"
    allowed_domains = ["example.com"]

    custom_settings = {
        "LOG_LEVEL": "INFO",
    }

    def start_requests(self):
        url = "https://www.example.com"
        yield scrapy.Request(
            url,
            callback=self.parse,
            meta={"playwright": True, "playwright_include_page": True},
        )

    async def parse(self, response):
        print("Hello ")
        page: Page = response.meta["playwright_page"]
        await page.close()
        print("Hello from parse next")
        yield scrapy.Request(
            response.url,
            callback=self.parse_next,
            meta={"playwright": True},
            errback=self.errback_close_page,
        )
        print("Hello from second parse next")

    def parse_next(self, response):
        print(response.url)

这里的问题是parse_next永远不会被调用。
这是我得到的输出

Hello 
Hello from parse next
Hello from second parse next
2022-11-03 07:26:03 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-03 07:26:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 193,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 260537,
...
}
2022-11-03 07:26:03 [scrapy.core.engine] INFO: Spider closed (finished)
2022-11-03 07:26:03 [scrapy-playwright] INFO: Closing download handler
2022-11-03 07:26:03 [scrapy-playwright] INFO: Closing download handler
2022-11-03 07:26:03 [scrapy-playwright] INFO: Closing browser

问题是parse_next永远不会被调用,因此print(response.url)行永远不会被执行。
有什么想法吗?
我已经遵循了剧作家和scrapy文件,但我不知道我错过了什么在这里。
这是一个玩具示例,它似乎不起作用。

rt4zxlrg

rt4zxlrg1#

1.“playwright_include_page方法是不可用的,请改用playwright_page_methods
1.从网页中选择所需的数据元素,并将其用作页面方法
1.我在这里使用的网页作为一个例子,其中包含4个表列表,但我的目标是选择第一个,这是页面方法选择,我想刮取其中的数据,其中元素选择是在def parse(self,response):

范例:

import scrapy
from scrapy_playwright.page import PageMethod

class TestSpider(scrapy.Spider):
    name = "test"
    def start_requests(self):
        yield scrapy.Request(

            url="https://info.uniswap.org/#/",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", '((//*[@class="sc-brqgnP klmLHi"])[1]//*[@class="sc-brqgnP klmLHi"])[1]'),
                ],
            },
        )

    def parse(self, response):

        products = response.xpath('((//*[@class="sc-brqgnP klmLHi"])[1]//*[@class="sc-brqgnP klmLHi"])[1]//div[@class="sc-bXGyLb ePvtyo"]')
        for product in products:
            yield {
            'price':product.xpath('.//*[@class="sc-chPdSV goKJOd sc-bMVAic eOIWzG css-63v6lo"][1]/text()').get(),

            }

输出:

{'price': '$1.00'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.55k'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.00'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$20.28k'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.00'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.00'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.21'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.53k'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$0.60'}
2022-11-03 14:27:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://info.uniswap.org/#/>
{'price': '$1.00'}

**settings.pt文件:**您必须在www.example.com文件中添加以下要求settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 10000

相关问题