我正在做一个关于用scrapy-plawright抓取动态加载的内容的练习项目,但是我碰到了一堵墙,不知道问题出在哪里。蜘蛛只是拒绝启动抓取过程,并且卡在了"Telnet控制台侦听127.0.0.1:6023"部分。
我按照教程中的建议设置了项目。
这是我的settings. py的相关部分的样子(我也尝试了其他设置,试图像CONCURRENT_REQUESTS
和COOKIES_ENABLED
一样修复它,但没有更改)
import asyncio
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
这就是蜘蛛本身
class roksh_crawler(scrapy.Spider):
name = "roksh_crawler"
def start_requests(self):
yield Request(
url="https://www.roksh.com/",
callback=self.parse,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("screenshot", path="example.png", full_page=True),
],
},
)
def parse(self, response):
screenshot = response.meta["playwright_page_methods"][0]
# screenshot.result contains the image's bytes
我试着截图的网页,但没有其他工作,所以我认为这不是问题所在。
这是我得到的日志
2022-11-24 09:54:19 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: roksh_crawler) 2022-11-24 09:54:19 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.0.1, Twisted 21.7.0, Python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.3, Platform Windows-10-10.0.19045-SP0 2022-11-24 09:54:19 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'roksh_crawler', 'CONCURRENT_REQUESTS': 32, 'NEWSPIDER_MODULE': 'roksh.spiders', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['roksh.spiders'], 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'} 2022-11-24 09:54:19 [asyncio] DEBUG: Using selector: SelectSelector 2022-11-24 09:54:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor 2022-11-24 09:54:19 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop 2022-11-24 09:54:19 [scrapy.extensions.telnet] INFO: Telnet Password: 7aad12ee78cfff92 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-11-24 09:54:19 [scrapy.core.engine] INFO: Spider opened 2022-11-24 09:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:54:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-11-24 09:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:58:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 09:59:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:00:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:01:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:02:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-11-24 10:03:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
这会无限地延续下去。
我也尝试了不同的URL,但得到了相同的结果,所以我认为问题是在我的一端,而不是在服务器的。另外,如果我试图运行蜘蛛没有剧作家(所以如果我从设置中取出DOWNLOAD_HANDLERS
),那么它的工作,虽然它只返回源HTML,这不是我想要的结果。
1条答案
按热度按时间qacovj5a1#
对我来说很好。
只需在
settings.py
文件中删除或注解掉这些行