如何使用Scrapy绕过Cloudflare限制?

a14dhokn  于 2023-03-18  发布在  其他
关注(0)|答案(1)|浏览(216)

下面的脚本最后抛出错误 “注意!|Cloudflare” 当我尝试使用response.css('title::text').get()作为测试来获取数据时。

碎蜘蛛:

import scrapy

class DataSpider(scrapy.Spider):
    name = "avvo"

    def start_requests(self):
        urls = [
            'https://www.avvo.com/attorneys/84025-ut-jason-hunter-284784/reviews.html',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def __init__(self):
        self.called = False
        
        self.data = {}
        
    def parse(self, response):
        if not self.called:
            self.called = True

            self.data["website"] = response.css('title::text').get()
            
            yield self.data

结果:

'Attention Required! | Cloudflare'
y1aodyip

y1aodyip1#

你可以使用 selenium 来绕过云耀斑。

from scrapy_selenium import SeleniumRequest

    def start_requests(self):
        # Driver Path and Options for Selenium is done in settings file
        yield SeleniumRequest(
            url='http://example.com',
            wait_time=3,
            callback=self.parse,
        )

    def parse(self, response):
        # Get selenium web driver from response object
        driver = response.meta['driver']
     

        # Grab Modified response from webdriver
        page_html = driver.page_source
        pageResponseObj = Selector(text=page_html)

相关问题