如何在Scrapy Python中异步运行playwright?

q8l4jmvw  于 2023-06-06  发布在  Python
关注(0)|答案(1)|浏览(234)

我的示例代码可以同步工作,没有错误。它解析所有的起始URL并将所需的文本打印到输出中。
如何让这段代码异步工作?(我需要运行playwright调用函数call_playwright(),而不是yielding)

import scrapy
from scrapy.crawler import CrawlerProcess
from playwright.sync_api import sync_playwright

class ToscrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['example.com']
    start_urls = [
        'https://example.com/?1',
        'https://example.com/?2',
        'https://example.com/?3',
    ]

    def parse(self, response):
        link = response.url
        self.call_playwright(link)

    
    def call_playwright(self, url):
        with sync_playwright() as playwright:
            browser = playwright.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url)
            page.wait_for_selector('h1', timeout=5000)
            tag = page.locator('h1')
            print(tag.inner_text())
            browser.close()
        
        
if __name__ == "__main__":
    # process = CrawlerProcess()
    process.crawl(ToscrapeSpider)
    process.start()

这是我对playwright的异步使用的版本:

import scrapy
from scrapy.crawler import CrawlerProcess
from playwright.async_api import async_playwright

class ToscrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['example.com']
    start_urls = [
        'https://example.com/?1',
        'https://example.com/?2',
        'https://example.com/?3',
    ]

    async def parse(self, response):
        link = response.url
        await self.call_playwright(link)

    
    async def call_playwright(self, url):
        async with async_playwright() as playwright:
            browser = await playwright.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(url)
            await page.wait_for_selector('h1', timeout=5000)
            tag = page.locator('h1')
            print(tag.inner_text())
            await browser.close()
        
        
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(ToscrapeSpider)
    process.start()

当我尝试运行它时,我得到一个错误:

Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks   
    result = context.run(gen.send, result)
  File "h:\spiders\getprices.py", line 21, in parse
    await self.call_playwright(link)
  File "h:\spiders\getprices.py", line 25, in call_playwright
    async with async_playwright() as playwright:
  File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\async_api\_context_manager.py", line 31, in __aenter__
    loop = asyncio.get_running_loop()
RuntimeError: no running event loop

你能帮我一下吗?Python 3.10.7、Scrapy 2.8.0、Playwright 1.34.0

goqiplq2

goqiplq21#

我为asyncio使用的模式类型如下所示:

async def main():
    ''' some async code '''

    return


if __name__ == "__main__":

    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

相关问题