使用Python/Scrapy处理返回HTTP 500代码的页面

7xzttuei  于 2022-11-09  发布在  Python
关注(0)|答案(2)|浏览(285)

我有问题,以访问一些网站,返回HTTP 500代码沿着正确格式的HTML页面。
所以,我可以用Chorme/Firefox下载页面,但我不能用Scrapy下载。
报废日志:

2020-04-10 15:57:16 [scrapy.core.engine] INFO: Spider opened
2020-04-10 15:57:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-10 15:57:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-10 15:57:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 1 times): 500 Internal Server Error
2020-04-10 15:57:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 2 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 3 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (referer: None)
2020-04-10 15:57:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html>: HTTP status code is not handled or not allowed

请参见下面的屏幕截图,其中显示了Web服务器返回HTTP 500沿着在Firefox中正确呈现的网页。

测试页为https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html
谢谢,如果我需要补充任何细节,请告诉我。

csga3l58

csga3l581#

如果您只想在一个蜘蛛上处理该问题:

class MySpider(Spider):
    ...
    handle_httpstatus_list = [500]

如果您只想处理一个请求:

...
def my_parse_method(self, response):
    ...
    yield Request(url='http://example.com', meta={'handle_httpstatus_list': [500]})
xpcnnkqh

xpcnnkqh2#

默认情况下,scrapy会忽略500状态代码,并且不处理它的响应。但是您可以通过在spider类中指定它来覆盖此设置。
大概是这样的:

class YourSpider:
    custom_settings = {
        'HTTPERROR_ALLOWED_CODES': [500]
    }

此处提供更多信息

相关问题