如何使用Scrapy检查HTTP错误代码的响应状态？

pb3s4cty 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(205)

我想检查响应状态并使用Scrapy将其导出到CSV文件。我尝试使用response.status，但它只显示“200”并导出到CSV文件。如何获得其他状态代码，如“404”、“502”等。

def parse(self, response):
        yield {
            'URL': response.url,
            'Status': response.status
        }

scrapy

来源：https://stackoverflow.com/questions/74144609/how-to-check-response-status-for-http-error-codes-using-scrapy

2条答案

按热度按时间

h7appiyu1#

在您的设置中，您可以调整这些设置，以确保某些错误代码不会被scrappy自动过滤。

HTTPERROR_允许的代码

预设值：[]
传递此列表中包含的所有非200状态代码的响应。

HTTPERROR_允许全部

默认值：False
传递所有响应，而不考虑其状态代码。
settings.py

HTTPERROR_ALLOW_ALL = True

HTTPERROR_ALLOWED_CODES = [500, 501, 404 ...]

赞(0）回复(0）举报 2022-11-09

ctzwtxfj2#

您可以在请求中添加一个errback，然后在errback函数中捕获http错误并生成所需的信息。

import scrapy
from scrapy.spidermiddlewares.httperror import HttpError

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request(url="https://example.com/error", errback=self.parse_error)

    def parse_error(self, failure):
        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            yield {
                'URL': response.url,
                'Status': response.status
            }

    def parse(self, response):
        yield {
            'URL': response.url,
            'Status': response.status
        }

赞(0）回复(0）举报 2022-11-09

我来回答

如何使用Scrapy检查HTTP错误代码的响应状态？

2条答案

HTTPERROR_允许的代码

HTTPERROR_允许全部

相关问题

热门标签

最新问答