使用Python/Scrapy处理返回HTTP 500代码的页面

7xzttuei 于 2022-11-09 发布在 Python

关注(0)|答案(2)|浏览(310)

我有问题，以访问一些网站，返回HTTP 500代码沿着正确格式的HTML页面。
所以，我可以用Chorme/Firefox下载页面，但我不能用Scrapy下载。
报废日志：

2020-04-10 15:57:16 [scrapy.core.engine] INFO: Spider opened
2020-04-10 15:57:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-10 15:57:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-10 15:57:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 1 times): 500 Internal Server Error
2020-04-10 15:57:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 2 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 3 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (referer: None)
2020-04-10 15:57:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html>: HTTP status code is not handled or not allowed

请参见下面的屏幕截图，其中显示了Web服务器返回HTTP 500沿着在Firefox中正确呈现的网页。

测试页为https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html
谢谢，如果我需要补充任何细节，请告诉我。

scrapy

来源：https://stackoverflow.com/questions/61143400/process-page-that-returns-http-500-code-with-python-scrapy

2条答案

按热度按时间

csga3l581#

如果您只想在一个蜘蛛上处理该问题：

class MySpider(Spider):
    ...
    handle_httpstatus_list = [500]

如果您只想处理一个请求：

...
def my_parse_method(self, response):
    ...
    yield Request(url='http://example.com', meta={'handle_httpstatus_list': [500]})

赞(0）回复(0）举报 2022-11-09

xpcnnkqh2#

默认情况下，scrapy会忽略500状态代码，并且不处理它的响应。但是您可以通过在spider类中指定它来覆盖此设置。
大概是这样的：

class YourSpider:
    custom_settings = {
        'HTTPERROR_ALLOWED_CODES': [500]
    }

此处提供更多信息

赞(0）回复(0）举报 2022-11-09

我来回答

使用Python/Scrapy处理返回HTTP 500代码的页面

2条答案

相关问题

热门标签

最新问答