如何使用cloudscraper与scrapy

whitzsjs  于 11个月前  发布在  其他
关注(0)|答案(1)|浏览(154)

我试图从一个网站解析数据,我使用scrapy,但该网站受到cloudflare的保护。我找到了一个解决方案,使用cloudscraper,这个cloudscraper确实可以绕过保护。但我不明白它如何与scrapy一起使用。
想写这样的东西

import scrapy
   from scrapy.xlib.pydispatch import dispatcher
   import cloudscraper
   import requests
   from scrapy.http import Request, FormRequest
   class PycoderSpider(scrapy.Spider):
      name = 'armata_exper'
      start_urls = ['https://arma-models.ru/catalog/sbornye_modeli/?limit=48']
      def start_requests(self):
         url = "https://arma-models.ru/catalog/sbornye_modeli/?limit=48"
         scraper = cloudscraper.CloudScraper()
         cookie_value, user_agent = scraper.get_tokens(url)
         yield scrapy.Request(url, cookies=cookie_value, headers={'User-Agent': user_agent})

      def parse(self, response):
         ....

字符串
收到错误

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/usr/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
    redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'Response' object has no attribute 'meta'
Unhandled Error

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/usr/lib/python3.6/site-packages/scrapy/crawler.py", line 309, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python3.6/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/usr/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 91, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'Response' object has no attribute 'dont_filter'


请告诉我怎么做才对

brtdzjyr

brtdzjyr1#

我已经成功地使用Scrapy downloader middlewares集成了Scrapy和Cloudscraper。
这是我提出的中间件:

import cloudscraper
import logging
from scrapy.http import HtmlResponse

class CustomCloudflareMiddleware(object):

    cloudflare_scraper = cloudscraper.create_scraper()

    def process_response(self, request, response, spider):
        request_url = request.url
        response_status = response.status
        if response_status not in (403, 503):
            return response
        
        spider.logger.info("Cloudflare detected. Using cloudscraper on URL: %s", request_url)
        cflare_response = self.cloudflare_scraper.get(request_url)
        cflare_res_transformed = HtmlResponse(url = request_url, body=cflare_response.text, encoding='utf-8')
        return cflare_res_transformed

字符串
我使用process_response中间件方法。如果我检测到响应是403或503,那么我使用cloudscraper执行相同的请求。否则,我只是继续正常的管道(为了简单起见,也可以删除这个if以始终使用cloudscraper;或者定义更精确的条件来使用或不使用cloudscraper)。此外,由于requests响应与Scrapy不同,我们需要将它们转换为Scrapy响应。
最后,您必须在spider中配置中间件。我喜欢通过定义custom_settings类变量来实现这一点:

class MyCrawler(CrawlSpider):
    name = 'mycrawlername'

    custom_settings = {
      'USER_AGENT': '...',
      'CLOSESPIDER_PAGECOUNT': 20,
      'DOWNLOADER_MIDDLEWARES': {            
        'middlewares.CustomCloudflareMiddleware.CustomCloudflareMiddleware': 543,             
      }
    }

    # my rules...
    # my parsing functions...


(the中间件的确切路径将取决于您的项目结构)
你可以找到我完整的例子here

相关问题