我的示例代码可以同步工作,没有错误。它解析所有的起始URL并将所需的文本打印到输出中。
如何让这段代码异步工作?(我需要运行playwright调用函数call_playwright(),而不是yielding)
import scrapy
from scrapy.crawler import CrawlerProcess
from playwright.sync_api import sync_playwright
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['example.com']
start_urls = [
'https://example.com/?1',
'https://example.com/?2',
'https://example.com/?3',
]
def parse(self, response):
link = response.url
self.call_playwright(link)
def call_playwright(self, url):
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
page.wait_for_selector('h1', timeout=5000)
tag = page.locator('h1')
print(tag.inner_text())
browser.close()
if __name__ == "__main__":
# process = CrawlerProcess()
process.crawl(ToscrapeSpider)
process.start()
这是我对playwright的异步使用的版本:
import scrapy
from scrapy.crawler import CrawlerProcess
from playwright.async_api import async_playwright
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['example.com']
start_urls = [
'https://example.com/?1',
'https://example.com/?2',
'https://example.com/?3',
]
async def parse(self, response):
link = response.url
await self.call_playwright(link)
async def call_playwright(self, url):
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url)
await page.wait_for_selector('h1', timeout=5000)
tag = page.locator('h1')
print(tag.inner_text())
await browser.close()
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(ToscrapeSpider)
process.start()
当我尝试运行它时,我得到一个错误:
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "h:\spiders\getprices.py", line 21, in parse
await self.call_playwright(link)
File "h:\spiders\getprices.py", line 25, in call_playwright
async with async_playwright() as playwright:
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\async_api\_context_manager.py", line 31, in __aenter__
loop = asyncio.get_running_loop()
RuntimeError: no running event loop
你能帮我一下吗?Python 3.10.7、Scrapy 2.8.0、Playwright 1.34.0
1条答案
按热度按时间goqiplq21#
我为asyncio使用的模式类型如下所示: