Scrapy:如何以特定的json格式输出项目

nwlqm0z1  于 2022-11-09  发布在  其他
关注(0)|答案(4)|浏览(166)

我以json格式输出scrapy数据。默认scrapy导出器以json格式输出dict列表。项目类型如下所示:

[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]

但我想以如下的特定格式导出数据:

{
"Shop Name":"Shop 1",
"Location":"XXXXXXXXX",
"Contact":"XXXX-XXXXX",
"Products":
[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]
}

有什么办法请指教。谢谢。

ix0qys7i

ix0qys7i1#

这是很好的记录在scrapy网页这里。

from scrapy.exporters import JsonItemExporter

class ItemPipeline(object):

    file = None

    def open_spider(self, spider):
        self.file = open('item.json', 'w')
        self.exporter = JsonItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

这将创建一个包含您的项的json文件。

irtuqstp

irtuqstp2#

我试图导出漂亮的JSON,这对我很有效。
我创建了一个管道,如下所示:

class JsonPipeline(object):

    def open_spider(self, spider):
        self.file = open('your_file_name.json', 'wb')
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            sort_keys=True,
            indent=4,
            separators=(',', ': ')
        ) + ",\n"

        self.file.write(line)
        return item

它与scrappy docs https://doc.scrapy.org/en/latest/topics/item-pipeline.html中的示例类似,不同之处在于它将每个JSON属性缩进并打印在新的一行上。
请在此处查看有关漂亮打印的部分https://docs.python.org/2/library/json.html

qvk1mo1f

qvk1mo1f3#

还有一种可能的解决方案是直接从命令行直接从spider在json中生成spider输出。

scrapy crawl "name_of_your_spider" -a NAME_OF_ANY_ARGUMENT=VALUE_OF_THE_ARGUMENT -o output_data.json
mm5n2pyu

mm5n2pyu4#

另一种从scrapy spider获取抓取/爬行输出的json导出的方法是启用feed导出,内置的功能,这些功能在scrappy类中提供,可以根据需要启用或禁用。可以通过定义custom_settings来实现这一点(覆盖)。这最终覆盖了这个特定蜘蛛的整体scrapy项目设置。
因此,对于任何名为“sample_spider”的蜘蛛:

class SampleSpider(scrapy.Spider):
    name = "sample_spider"
    allowed_domains = []

    custom_settings = {
        'FEED_URI': 'sample_spider_crawled_data.json',
        'FEED_FORMAT': 'json',
        'FEED_EXPORTERS': {
            'json': 'scrapy.exporters.JsonItemExporter',
        },
        'FEED_EXPORT_ENCODING': 'utf-8',
    }

相关问题