Windows环境下的Scrapy-Playwright集成

pu3pd22g  于 2022-11-09  发布在  Windows
关注(0)|答案(1)|浏览(492)

我正在尝试使用scrapy-playwright库来解析/抓取基于JavsScript的网站。在工作时,我了解到这与windows系统known issue不兼容。我在这里给出了最小的可重复性

import scrapy
from asyncio.windows_events import *
from scrapy.crawler import CrawlerProcess

class Play1Spider(scrapy.Spider):
    name = 'play1'

    def start_requests(self):
        yield scrapy.Request("http://testphp.vulnweb.com/",
                             callback=self.parse,
                             meta={'playwright': True,
                                   'playwright_include_page': True,

                                       })

    async def parse(self, response):
        yield{
            'text': response.text
        }

if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
            "FEED_URI":'Products.jl',
            "FEED_FORMAT":'jsonlines',
        }
    )
    process.crawl(Play1Spider)
    process.start()

下面是错误堆栈跟踪

2022-07-12 16:58:42 [scrapy.core.engine] INFO: Spider opened
2022-07-12 16:58:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-12 16:58:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-12 16:58:43 [scrapy-playwright] INFO: Starting download handler
2022-07-12 16:58:43 [scrapy-playwright] INFO: Starting download handler
2022-07-12 16:58:43 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-3' coro=<Connection.run() done, defined at C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py:212> exception=NotImplementedError()>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:43 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-4' coro=<Connection.run() done, defined at C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py:212> exception=NotImplementedError()>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:43 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ScrapyPlaywrightDownloadHandler._engine_started of <scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler object at 0x000001B089014970>>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 127, in _launch
    self.playwright = await self.playwright_context_manager.start()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 51, in start     
    return await self.__aenter__()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 46, in __aenter__    playwright = AsyncPlaywright(next(iter(done)).result())
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:43 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ScrapyPlaywrightDownloadHandler._engine_started of <scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler object at 0x000001B08964B8E0>>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 127, in _launch
    self.playwright = await self.playwright_context_manager.start()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 51, in start     
    return await self.__aenter__()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 46, in __aenter__    playwright = AsyncPlaywright(next(iter(done)).result())
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:48 [scrapy.core.scraper] ERROR: Error downloading <GET http://testphp.vulnweb.com/>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1656, in _inlineCallbacks       
    result = current_context.run(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\python\failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy\core\downloader\middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 247, in _download_request    
    page = await self._create_page(request)
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 185, in _create_page
    context = await self._create_browser_context(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 160, in _create_browser_context
    await self._maybe_launch_browser()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 144, in _maybe_launch_browser    logger.info(f"Launching browser {self.browser_type.name}")
AttributeError: 'ScrapyPlaywrightDownloadHandler' object has no attribute 'browser_type'
2022-07-12 16:58:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-12 16:58:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/builtins.AttributeError': 1,
 'downloader/request_bytes': 229,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'elapsed_time_seconds': 5.260977,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 7, 12, 11, 28, 48, 293797),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 5,
 'log_count/INFO': 12,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 7, 12, 11, 28, 43, 32820)}
2022-07-12 16:58:48 [scrapy.core.engine] INFO: Spider closed (finished)
2022-07-12 16:58:48 [scrapy-playwright] INFO: Closing download handler
2022-07-12 16:58:48 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method DownloadHandlers._close of <scrapy.core.downloader.handlers.DownloadHandlers object at 0x000001B088FDA920>>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1656, in _inlineCallbacks       
    result = current_context.run(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\python\failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy\core\downloader\handlers\__init__.py", line 81, in _close 
    yield dh.close()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1656, in _inlineCallbacks       
    result = current_context.run(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 229, in close       ntoGenera
    yield deferred_from_coro(self._close())
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 237, in _close      
    await self.playwright_context_manager.__aexit__()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 54, in __aexit__
    await self._connection.stop_async()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 230, in stop_asyn_aexit__ c
    self._transport.request_stop()                                                                                                c        
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 107, in request_stop                                                                                                                                op       
    assert self._output
AttributeError: 'PipeTransport' object has no attribute '_output'

我已经看过类似的问题other solution,但没有得到任何结论。我知道它可能在WSL或MacOS上工作...但我现在需要为Windows建立一个解决方案。我正在寻找各种建议/解决方案,如果有人遇到类似的问题。而且,我也愿意尝试其他库,如果有的话。
附言:已经通过 selenium ,刮 puppet 师(类似的问题),和刮飞溅。
期待听到一些建议和反馈。TIA

23c0lvtd

23c0lvtd1#

asyncio的Windows实现可以使用两种事件循环实现:SelectorEventLoop,Python 3.8之前的默认值,使用Twisted时需要。ProactorEventLoop,Python 3.8之后的默认值,不能与Twisted一起使用。
所以在Python 3.8+中,事件循环类需要被改变。
2.6.0版中的变更:当您变更TWISTED_REACTOR设定或呼叫install_reactor()时,事件循环类别会自动变更。
要手动更改事件循环类,请在安装reactor之前调用以下代码:

import asyncio
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

你可以把它放在安装reactor的同一个函数中,如果你自己这样做的话,或者放在安装reactor之前运行的一些代码中,例如settings.py。
文件:异步的Windows实现

相关问题