覆盖Scrapy日志记录,尤其是来自中间件的日志记录

bn31dyow  于 2022-12-18  发布在  其他
关注(0)|答案(1)|浏览(280)

我曾在一个项目中使用过Scrapy,其中我有自己的JSON日志记录格式。
我希望避免Scrapy中的任何多行stacktrace,尤其是robots.txt中的中间件。我希望它是一个正确的一行错误或整个stacktrace捆绑到一个消息中。
如何禁用或覆盖这种日志记录行为?下面是我从robots.txt的下载中间件获得的一个示例stacktrace

2017-10-03 19:08:57 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.example.com/robots.txt>: DNS lookup failed: no results for hostname lookup: www.example.com. Traceback (most recent call last):   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
    "no results for hostname lookup: {}".format(self._hostStr) twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.example.com.
iyzzxitl

iyzzxitl1#

我不知道为什么不喜欢多行的错误信息(这是异常的追溯打印)。无论如何,我们可以自定义Scrapy日志的格式。假设你是通过scrapy命令行运行你的抓取脚本,例如,scrapy crawlscrapy runspider。下面是一个示例代码(python 3版本),展示了如何使用你自己的格式化程序。

import logging
    import scrapy
    
    
    class OneLineFormatter(logging.Formatter):
    
        def __init__(self, *args, **kwargs):
            super(OneLineFormatter, self).__init__(*args, **kwargs)
    
        def format(self, record):
            formatted = super(OneLineFormatter, self).format(record)
            return formatted.replace('\n', ' ')
    
    
    class TestSpider(scrapy.Spider):
        name = "test"
        start_urls = [
            'http://www.somenxdomain.com/robots.txt',
        ]
    
        def __init__(self, fmt, datefmt, *args, **kwargs):
            my_formatter = OneLineFormatter(fmt=fmt, datefmt=datefmt)
            root = logging.getLogger()
            for h in root.handlers:
                h.setFormatter(my_formatter)
            super(TestSpider, self).__init__(*args, **kwargs)
    
        @classmethod
        def from_crawler(cls, crawler):
            settings = crawler.settings
            return cls(settings.get('LOG_FORMAT'), settings.get('LOG_DATEFORMAT'))
    
        def parse(self, response):
            pass

以下是一些解释。

  1. Python日志工作流程。scrapy本身使用了python内置的日志系统。因此你需要一些python日志的基本知识,特别是LoggerHandlerFilterFormatter类之间的关系。我强烈推荐python日志的工作流程。
  2. Scrapy日志记录和设置。如果你的spider是通过scrapy命令行运行的,例如scrapy crawlscrapy runspider,那么Scrapy函数[configure_logging](https://docs.python.org/2/howto/logging.html#logging-flow)会被调用来初始化日志记录。Scrapy日志记录的指令可以给予一些如何定制日志记录的指令,通过Scrapy设置你可以访问你的设置。
    1.示例代码的工作原理。基本工作流程为:
  • 首先,您需要定义自己的格式化程序类来定制日志格式。
  • 其次,在爬行器中,需要访问格式化设置以初始化格式化程序类。
  • 最后,在您的spider中,您将获得root记录器,并将格式化程序设置为root的所有处理程序。

如果您编写自己的脚本并使用Scrapy作为API,请参阅[从脚本运行Scrapy](https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script),然后您需要自己配置日志记录。
上述格式化程序在spider初始化之前不会工作。下面是一些打印文件:

2017-10-03 11:59:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
    2017-10-03 11:59:39 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.logstats.LogStats']
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  'scrapy.spidermiddlewares.referer.RefererMiddleware',  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled item pipelines: []
    2017-10-03 11:59:39 [scrapy.core.engine] INFO: Spider opened
    2017-10-03 11:59:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2017-10-03 11:59:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
    2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.somenxdomain.com/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.somenxdomain.com/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.somenxdomain.com/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:39 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.somenxdomain.com/robots.txt> Traceback (most recent call last):   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks     result = result.throwExceptionIntoGenerator(g)   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator     return g.throw(self.type, self.value, self.tb)   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request     defer.returnValue((yield download_func(request=request,spider=spider)))   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks     current.result = callback(current.result, *args, **kw)   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts     "no results for hostname lookup: {}".format(self._hostStr) twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:40 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-10-03 11:59:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3,  'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,  'downloader/request_bytes': 684,  'downloader/request_count': 3,  'downloader/request_method_count/GET': 3,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 10, 3, 15, 59, 40, 46636),  'log_count/DEBUG': 4,  'log_count/ERROR': 1,  'log_count/INFO': 7,  'scheduler/dequeued': 3,  'scheduler/dequeued/memory': 3,  'scheduler/enqueued': 3,  'scheduler/enqueued/memory': 3,  'start_time': datetime.datetime(2017, 10, 3, 15, 59, 39, 793795)}
    2017-10-03 11:59:40 [scrapy.core.engine] INFO: Spider closed (finished)

您可以看到,运行spider之后,所有消息都格式化为一行(通过删除'\n')。

相关问题