无法从Scrapy获取Scrapy统计信息,CrawlerProcess

ia2d9nvy  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(163)

我正在从另一个脚本运行scrapy spider,我需要从Crawler中检索并保存变量统计信息。我已经查看了文档和其他StackOverflow问题,但我还没有能够解决这个问题。
这是我运行爬行的脚本:

import scrapy
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({})
process.crawl(spiders.MySpider)
process.start()

stats = CrawlerProcess.stats.getstats() # I need something like this

我希望统计数据包含这段数据(scrapy.statscollectors):

{'downloader/request_bytes': 44216,
     'downloader/request_count': 36,
     'downloader/request_method_count/GET': 36,
     'downloader/response_bytes': 1061929,
     'downloader/response_count': 36,
     'downloader/response_status_count/200': 36,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 11, 9, 16, 31, 2, 382546),
     'log_count/DEBUG': 37,
     'log_count/ERROR': 35,
     'log_count/INFO': 9,
     'memusage/max': 62623744,
     'memusage/startup': 62623744,
     'request_depth_max': 1,
     'response_received_count': 36,
     'scheduler/dequeued': 36,
     'scheduler/dequeued/memory': 36,
     'scheduler/enqueued': 36,
     'scheduler/enqueued/memory': 36,
     'start_time': datetime.datetime(2018, 11, 9, 16, 30, 38, 140469)}

我已经检查了CrawlerProcess,它返回延迟,并在抓取过程完成后从其“爬虫”字段中删除爬虫。
有什么办法解决这个问题吗?
贝斯特彼得

lf5gs5x2

lf5gs5x21#

根据文档,CrawlerProcess.crawl接受crawler或spider类,并且您可以通过CrawlerProcess.create_crawler从spider类创建crawler。
因此,您可以在开始爬网过程之前创建Crawler示例,然后检索所需的属性。
下面我给你一个例子,通过编辑你的原始代码的几行:

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        self.crawler.stats.inc_value('foo')

process = CrawlerProcess({})
crawler = process.create_crawler(TestSpider)
process.crawl(crawler)
process.start()

stats_obj = crawler.stats
stats_dict = crawler.stats.get_stats()

# perform the actions you want with the stats object or dict
fsi0uk1n

fsi0uk1n2#

如果你想通过信号获取脚本中的统计信息。这会有帮助-

def spider_results(spider):
    results = []
    stats = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    def crawler_stats(*args,**kwargs): # runs when spider closed
        stats.append(kwargs['sender'].stats.get_stats())

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    dispatcher.connect(crawler_stats, signal=signals.spider_closed)

    process = CrawlerProcess()
    process.crawl(spider)
    process.start()  # the script will block here until the crawling is finished
    return results, stats

希望能帮上忙!

相关问题