scrapy 如何在剪贴板中检查断开的链接?

kkih6yb8  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(115)

我有一个链接数组,如何检查是否有断开的链接方法。通常,我需要实现类似于以下结构的东西

def parse(self, response,**cb_kwargs):
    for link in links:
        *if response HTTP 404 callback=self.parse_data...*
        *elif response HTTP 200 callback=self.parse_product...*

def parse_data(self, response,**cb_kwargs):
    pass

def parse_product(self, response,**cb_kwargs):
    pass

事实是我需要知道第一个方法(parse)中的状态,这可能吗?

mwecs4sa

mwecs4sa1#

你可以在stat_urls中添加链接,在parse()中,你可以选中response.status(并获得response.url),你可以直接运行代码来处理这个url --没有必要用Requests再次发送它--除了Scrapy(默认)跳过相同的请求。
但是Scrapy跳过parse()的url,这会产生错误,所以你必须改变列表handle_httpstatus_list

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
        'http://httpbin.org/get',    # 200
        'http://httpbin.org/error',  # 404
        'http://httpbin.org/post',   # 405
    ]

    handle_httpstatus_list = [404, 405]

    def parse(self, response):
        print('url:', response.url)
        print('status:', response.status)

        if response.status == 200:
            self.process_200(response)

        if response.status == 404:
            self.process_404(response)

        if response.status == 405:
            self.process_405(response)

    def process_200(self, response):
        print('Process 200:', response.url)

    def process_404(self, response):
        print('Process 404:', response.url)

    def process_405(self, response):
        print('Process 405:', response.url)

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({

# 'USER_AGENT': 'Mozilla/5.0',

# 'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

})
c.crawl(MySpider)
c.start()

编辑:

我没有测试,但在文档中您也可以看到
在请求处理中使用错误返回捕获异常
它显示了当errback=function出错时如何使用errback=functionresponse发送到function

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

还有
访问errback函数中的附加数据

相关问题