我有一个Python代码,它从本地MongoDB获取JSON文档,提取嵌套数组并将其替换为id。然后,它将生成的三组数据导出到CSV文件。它工作得很好,但它缺少一件事:我需要在生成的标签和评论文件中添加一个post_id列,这样我就可以将评论和标签对应到它们所来自的帖子。
目前,我的代码看起来像这样:
import json
import csv
from pymongo import MongoClient
from bson import ObjectId, json_util
# Function to extract arrays and replace them with ID
def process_json(json_data):
result = {}
arrays = {}
item_id = json_data["_id"]
if isinstance(item_id, dict):
item_id = item_id["$oid"]
if isinstance(item_id, ObjectId):
item_id = str(item_id)
result[item_id] = {}
for key, value in json_data.items():
if key != "_id":
if isinstance(value, list):
array_id = f"{key}_id"
arrays[array_id] = value
result[item_id][array_id] = item_id
else:
result[item_id][key] = value
if "tags_id" in result[item_id]:
post_id = result[item_id].get("tags_id")
result[item_id]["post_id"] = post_id
return result, arrays
# Connecting to MongoDB
client = MongoClient("localhost", 27017)
db = client.mongotest
collection = db.collectiontest
# Selecting n JSON documents from collection
n = 2
documents = collection.find().limit(n)
json_documents = [json.loads(json_util.dumps(document, default=json_util.default)) for document in documents]
# Appending all JSONs into one
output = {}
extracted_arrays = {}
for doc in json_documents:
processed_doc, arrays = process_json(doc)
item_id = list(processed_doc.keys())[0]
output[item_id] = processed_doc[item_id]
for array_id, array_values in arrays.items():
if array_id in extracted_arrays:
extracted_arrays[array_id].extend(array_values)
else:
extracted_arrays[array_id] = array_values
# Creating main output CSV file
with open("output.csv", 'w', newline='') as csvfile:
fieldnames_output = list(output[list(output.keys())[0]].keys())
writer = csv.DictWriter(csvfile, fieldnames=fieldnames_output)
writer.writeheader()
writer.writerows(output.values())
# Creating tags CSV file
with open("tags.csv", 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['tags'])
writer.writerows([[tag] for tag in extracted_arrays['tags_id']])
# Creating comments CSV file
with open("comments.csv", 'w', newline='') as csvfile:
fieldnames_com = extracted_arrays['comments_id'][0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames_com)
writer.writeheader()
writer.writerows(extracted_arrays['comments_id'])
# Printing output and extracted arrays for easier visual control
print("Extracted arrays:")
for array_id, array_values in extracted_arrays.items():
print(array_id, ":", array_values)
print("\nOutput:")
print(json.dumps(output, indent=2))
字符串
代码从MongoDB获取的JSON示例:
[{
"_id": {
"$oid": "50ab0f8bbcf1bfe2536dc3f9"
},
"body": "Amendment",
"permalink": "aRjNnLZkJkTyspAIoRGe",
"author": "machine",
"title": "Bill of Rights",
"tags": [
"watchmaker",
"santa",
"xylophone",
"math",
"handsaw",
"dream",
"undershirt",
"dolphin",
"tanker",
"action"
],
"comments": [
{
"body": "Lorem ",
"email": "HvizfYVx@pKvLaagH.com",
"author": "Santiago Dollins"
},
{
"body": "Lorem",
"email": "WpOUCpdD@hccdxJvT.com",
"author": "Jaclyn Morado"
},
{
"body": "Loremid est laborum",
"email": "OgDzHfFN@cWsDtCtx.com",
"author": "Houston Valenti"
}
],
"date": {
"$date": "2012-11-20T05:05:15.231Z"
}
}]
型
我不确定是在创建process_json函数的阶段还是在写入CSV的阶段将post_id添加到所有的注解和标记中,以及一般如何做到这一点。任何建议将不胜感激!!
1条答案
按热度按时间tyu7yeag1#
由于您还没有在示例JSON中包含预期输出CSV的示例,我真的不知道您想要什么。
我已经解释了你的问题和代码,意思是你想要的东西如下:
字符串
如果你的想法和你想要的很接近,那就继续读下去。
我相信你只需要迭代注解和标签列表,对于你在那里找到的每一个东西,关联文档的ID并将其保存到它自己的结构中的一个新的“行”,所以我做了这些结构来保存最后的行:
型
我简化了你的
process_json()
函数,以定位我在示例JSON中看到的键和对象/dicts:型
关于这段代码需要注意的一些事情:
{"id": id_, ...}
形式构建一个新的dict,以便id键位于第一位。{"id": id_, **comment_dict}
意味着创建一个新的dict,其中的key为“id”,然后继续使用comment_dict中的所有键值对。在示例JSON上运行它会创建final _rows列表:
型
从那里,创建CSV只需要创建DictWriter(就像你做的那样),尽管你可以从每个列表中给予每个DictWriter第一行来设置字段名:
型