使用Scrapy Playwright抓取动态网站

5cg8jx4n  于 2022-12-23  发布在  其他
关注(0)|答案(1)|浏览(452)

我试图刮javascript网站使用scrapy-playwright,但他们显示Crawled 0 pages是否有任何错误,我做了代码,为什么他们没有爬取任何数据,这些是页面链接https://www.coursera.org/search?query=python&utm_source=gg&utm_medium=sem&utm_campaign=B2C_INDIA__branded_FTCOF_courseraplus_arte_monthly&utm_content=B2C&campaignid=18216928761&adgroupid=141296026472&device=c&keyword=coursera%20online&matchtype=b&network=g&devicemodel=&adpostion=&creativeid=619458216863&gclid=CjwKCAiAkfucBhBBEiwAFjbkr5EhIFModjG1bK9jcqv126-AOgp4M-DzZCXXwLJyy_e16UZkmoUuxRoC_IcQAvD_BwE

import scrapy
from scrapy.http import Request
from scrapy_playwright.page import PageMethod

class TestSpider(scrapy.Spider):
    name = 'sample'

    def start_requests(self):
        yield scrapy.Request(

            url="https://www.coursera.org/search?query=python&utm_source=gg&utm_medium=sem&utm_campaign=B2C_INDIA__branded_FTCOF_courseraplus_arte_monthly&utm_content=B2C&campaignid=18216928761&adgroupid=141296026472&device=c&keyword=coursera%20online&matchtype=b&network=g&devicemodel=&adpostion=&creativeid=619458216863&gclid=CjwKCAiAkfucBhBBEiwAFjbkr5EhIFModjG1bK9jcqv126-AOgp4M-DzZCXXwLJyy_e16UZkmoUuxRoC_IcQAvD_BwE",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", "ul.cds-71"),
                ],
            },
        )
   
        
    def parse(self, response):
        yield{
            'text':response.text
        }
zazmityj

zazmityj1#

如果你使用的是windows,你不能直接使用playwright。要使用它,你必须在你的windows上设置WSL来运行它。你可以检查这个
https://github.com/scrapy-plugins/scrapy-playwright/issues/7
要了解如何使用WSL启动浏览器,https://github.com/scrapy-plugins/scrapy-playwright/issues/78

相关问题