我有下面的例子
import json
import scrapy
import scrapy_playwright
from scrapy.selector import Selector
from scrapy_playwright.handler import Page, PageMethod
class GreetingsSpider(scrapy.Spider):
name = "greetings"
allowed_domains = ["example.com"]
custom_settings = {
"LOG_LEVEL": "INFO",
}
def start_requests(self):
url = "https://www.example.com"
yield scrapy.Request(
url,
callback=self.parse,
meta={"playwright": True, "playwright_include_page": True},
)
async def parse(self, response):
print("Hello ")
page: Page = response.meta["playwright_page"]
await page.close()
print("Hello from parse next")
yield scrapy.Request(
response.url,
callback=self.parse_next,
meta={"playwright": True},
errback=self.errback_close_page,
)
print("Hello from second parse next")
def parse_next(self, response):
print(response.url)
这里的问题是parse_next永远不会被调用。
这是我得到的输出
Hello
Hello from parse next
Hello from second parse next
2022-11-03 07:26:03 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-03 07:26:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 193,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 260537,
...
}
2022-11-03 07:26:03 [scrapy.core.engine] INFO: Spider closed (finished)
2022-11-03 07:26:03 [scrapy-playwright] INFO: Closing download handler
2022-11-03 07:26:03 [scrapy-playwright] INFO: Closing download handler
2022-11-03 07:26:03 [scrapy-playwright] INFO: Closing browser
问题是parse_next永远不会被调用,因此print(response.url)行永远不会被执行。
有什么想法吗?
我已经遵循了剧作家和scrapy文件,但我不知道我错过了什么在这里。
这是一个玩具示例,它似乎不起作用。
1条答案
按热度按时间rt4zxlrg1#
1.“playwright_include_page方法是不可用的,请改用
playwright_page_methods
1.从网页中选择所需的数据元素,并将其用作页面方法
1.我在这里使用的网页作为一个例子,其中包含4个表列表,但我的目标是选择第一个,这是页面方法选择,我想刮取其中的数据,其中元素选择是在def
parse(self,response):
范例:
输出:
**settings.pt文件:**您必须在www.example.com文件中添加以下要求settings.py