我试着在scrape-playwrights文档中运行www.example.com上的滚动示例quotes.toscrape.com/scroll但由于React堆的问题,我甚至无法运行scrape-playwrights文档:
URL SPIDER TEST
***********************
SCRAPE STARTED
***********************
2022-08-11 15:47:38 [scrapy.crawler] INFO: Overridden settings: {'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} crawled: <Deferred at 0x11ef17e50 current result: <twisted.python.failure.Failure builtins.Exception: The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)>>
代码为:
import csv
import json
import pygsheets
import scrapy
from scrapy_playwright.page import PageMethod
import json
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.utils.reactor import install_reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner
import datetime as dt
from datetime import date
from twisted.internet import reactor, defer
import tempfile
def breaker(comment):
print('***********************')
print(comment)
print('***********************')
class UrlSpider(scrapy.Spider):
name = "Url"
custom_settings={
'DOWNLOAD_HANDLERS':{
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
}
def start_requests(self):
yield scrapy.Request(
url='http://quotes.toscrape.com/scroll',
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_methods=[
PageMethod('wait_for_selector','div.quote'),
PageMethod('evaluate','window.scrollBy(0, document.body.scrollHeight)'),
PageMethod('wait_for_selector','div.quote:nth-child(11)'),
],
),
)
async def parse(self, response):
page=response.meta['playwright_page']
await page.screenshot(path='quotes.png',full_page=True)
await page.close()
return {'quotes_count':len(response.css('div.quote'))}
print('URL SPIDER TEST')
configure_logging()
settings=get_project_settings()
runner = CrawlerRunner(settings)
@defer.inlineCallbacks
def crawl():
breaker('SCRAPE STARTED')
bug=runner.crawl(UrlSpider)
reactor.close()
yield bug
url_list=crawl()
print('crawled: '+str(url_list))
reactor.run()
我已经尝试了几个小时来寻找解决方案,但没有成功,我使用CrawlerRunner是因为我想在某个时候自动化代码,但即使使用CrawlerProcess,我也会得到错误。
我还使用自定义设置,因为我遇到了项目设置没有使用get_project_settings添加的问题,自定义设置让我确保它被使用。
如果我在自定义设置中删除扭曲React器的设置,蜘蛛会报废并产出,但React器错误再次发生,它不会检索任何东西。
1条答案
按热度按时间kuuvgm7e1#
在设置中,查找并注解:异步选择器React器= '2.7'
这是我最后的台词