Scrapy -输出到多个JSON文件

arknldoa 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(184)

我对Scrapy还很陌生。我正在研究用它来抓取整个网站的链接，在这个过程中，我会将项目输出到多个JSON文件中。这样我就可以将它们上传到亚马逊云搜索中进行索引。有没有可能将项目拆分成多个文件，而不是最终只有一个巨大的文件？从我所读到的来看，项目导出器只能输出到一个文件。2但是我只使用一个爬行器来完成这个任务。3如果我能设置一个项目的数量限制在每个文件中，比如500或1000，那就太好了。
以下是我到目前为止设置的代码（基于Dmoz.org教程中使用的www.example.com）：
dmoz_蜘蛛.py

import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import DmozItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/",
    ]

    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
       for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

items.py

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

谢谢你的帮助。

scrapy

来源：https://stackoverflow.com/questions/32870506/scrapy-output-to-multiple-json-files

2条答案

按热度按时间

i7uaboj41#

我不认为内置的提要导出器支持写入多个文件。
一个选项是导出到jsonlines的单个文件中，基本上每行一个JSON对象，这样便于管道传输和拆分。
然后，单独地，在爬网完成后，您可以read the file in the desired chunks并写入单独的JSON文件。
这样我就可以把它们上传到亚马逊云搜索上进行索引。
请注意，有一个直接的亚马逊S3出口商（不确定是否有帮助，仅供参考）。

赞(0）回复(0）举报 2022-11-09

umuewwlo2#

您可以为每个项目添加名称，并使用自定义管道输出到不同的json文件。如下所示：

from scrapy.exporters import JsonItemExporter
from scrapy import signals

class MultiOutputExporter(object):

    @classmethod
    def from_crawler(cls, crawler):

        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):

        self.items = ['item1','item2']
        self.files = {}
        self.exporters = {}

        for item in self.items:

            self.files[item] = open(f'{item}.json', 'w+b')
            self.exporters[item] = JsonItemExporter(self.files[item])
            self.exporters[item].start_exporting()

    def spider_closed(self, spider):

        for item in self.items:
            self.exporters[item].finish_exporting()
            self.files[item].close()

    def process_item(self, item, spider):
        self.exporters[item.name].export_item()
        return item

然后按如下所示为项目添加名称：

class Item(scrapy.Item):

   name = 'item1'

现在，启用管道在scrapy。设置和瞧。

赞(0）回复(0）举报 2022-11-09

我来回答

Scrapy -输出到多个JSON文件

2条答案

相关问题

热门标签

最新问答