我有问题,以访问一些网站,返回HTTP 500代码沿着正确格式的HTML页面。
所以,我可以用Chorme/Firefox下载页面,但我不能用Scrapy下载。
报废日志:
2020-04-10 15:57:16 [scrapy.core.engine] INFO: Spider opened
2020-04-10 15:57:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-10 15:57:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-10 15:57:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 1 times): 500 Internal Server Error
2020-04-10 15:57:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 2 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 3 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (referer: None)
2020-04-10 15:57:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html>: HTTP status code is not handled or not allowed
请参见下面的屏幕截图,其中显示了Web服务器返回HTTP 500沿着在Firefox中正确呈现的网页。
测试页为https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html
谢谢,如果我需要补充任何细节,请告诉我。
2条答案
按热度按时间csga3l581#
如果您只想在一个蜘蛛上处理该问题:
如果您只想处理一个请求:
xpcnnkqh2#
默认情况下,scrapy会忽略
500
状态代码,并且不处理它的响应。但是您可以通过在spider类中指定它来覆盖此设置。大概是这样的:
此处提供更多信息