正在为Scrapy中失败的请求重试下载程序中间件

qyswt5oh  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(345)

在Scrappy中,我试着写一个下载中间件,它过滤401,403,410的响应,并向这些URL发送一些新的请求。错误说response_request必须返回一个响应或一个请求。因为我生成了10个请求,以确保失败的URL是否被尝试了足够的次数。我该如何修复它?谢谢。
这是我在www.example.com上激活的中间件代码settings.py
'''

class NegativeResponsesDownloaderMiddlerware(Spider):

def process_response(self, request, response, spider): ## encode each request with its http status
    # Called with the response returned from the downloader.

    print("---(NegativeResponsesDownloaderMiddlerware)")
    filtered_status_list = ['401', '403', '410']
    adaptoz = FailedRequestsItem()
    if response.status in filtered_status_list:
        adaptoz['error_code'][response.url] = response.status
        print("---(process_response) => Sending URL back do DOWNLOADER: URL =>",response.url)

        for i in range(self.settings.get('ERROR_HANDLING_ATTACK_RATE')):
            yield Request(response.url, self.check_retrial_result,headers = self.headers)

        raise IgnoreRequest(f"URL taken out from first flow. Error Code: ", adaptoz['error_code']," => URL = ", resp)

    else:
        return response

    # Must either;
    # - return a Response object
    # - return a Request object
    # - or raise IgnoreRequest
def check_retrial_result(self, response):    
    if response.status == 200:
        x = XxxSpider()
        x.parse_event(response)
    else:

            return None

'''

nnsrf1az

nnsrf1az1#

不幸的是,当你把中间件方法的返回值转换成生成器时,scrapy不知道该怎么处理它,例如,你不能在中间件的任何接口方法中使用yield。
相反,您可以做的是生成请求序列,并将它们反馈到scrapy引擎,以便可以通过spider解析它们,就像它们包含在start_urlsstart_requests方法中一样。
您可以通过将每个创建的请求提供给spider.crawler.engine.crawl方法(如果它们通过了筛选器测试)并在完成循环后引发IgnoreRequest来完成此操作。

def process_response(self, request, response, spider):
    filtered_status_list = ['401', '403', '410']
    adaptoz = FailedRequestsItem()
    if response.status in filtered_status_list:
        adaptoz['error_code'][response.url] = response.status
        for i in range(self.settings.get('ERROR_HANDLING_ATTACK_RATE')):
            request = scrapy.Request(response.url, callback=callback_method, headers = self.headers)
            self.spider.crawler.engine.crawl(request, spider)
        raise IgnoreRequest(f"URL taken out from first flow. Error Code: ", adaptoz['error_code']," => URL = ", resp)
    return response
lf3rwulv

lf3rwulv2#

如果我理解正确的话,您尝试实现的目标可以仅使用设置来实现:

RETRY_TIMES=10  # Default is 2 
RETRY_HTTP_CODES=[401, 403,410] # Default: [500, 502, 503, 504, 522, 524, 408, 429]

医生来了。

相关问题