csv 获取从JSON文档中提取的嵌套数组的ID

von4xj4u  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(108)

我有一个Python代码,它从本地MongoDB获取JSON文档,提取嵌套数组并将其替换为id。然后,它将生成的三组数据导出到CSV文件。它工作得很好,但它缺少一件事:我需要在生成的标签和评论文件中添加一个post_id列,这样我就可以将评论和标签对应到它们所来自的帖子。
目前,我的代码看起来像这样:

import json
import csv
from pymongo import MongoClient
from bson import ObjectId, json_util

# Function to extract arrays and replace them with ID
def process_json(json_data):
    result = {}
    arrays = {}

    item_id = json_data["_id"]

    if isinstance(item_id, dict):
        item_id = item_id["$oid"]

    if isinstance(item_id, ObjectId):
        item_id = str(item_id)

    result[item_id] = {}

    for key, value in json_data.items():
        if key != "_id":
            if isinstance(value, list):
                array_id = f"{key}_id"
                arrays[array_id] = value
                result[item_id][array_id] = item_id
            else:
                result[item_id][key] = value

    if "tags_id" in result[item_id]:
        post_id = result[item_id].get("tags_id")
        result[item_id]["post_id"] = post_id

    return result, arrays

# Connecting to MongoDB
client = MongoClient("localhost", 27017)
db = client.mongotest
collection = db.collectiontest

# Selecting n JSON documents from collection
n = 2
documents = collection.find().limit(n)
json_documents = [json.loads(json_util.dumps(document, default=json_util.default)) for document in documents]

# Appending all JSONs into one
output = {}
extracted_arrays = {}

for doc in json_documents:
    processed_doc, arrays = process_json(doc)
    item_id = list(processed_doc.keys())[0]
    output[item_id] = processed_doc[item_id]

    for array_id, array_values in arrays.items():
        if array_id in extracted_arrays:
            extracted_arrays[array_id].extend(array_values)
        else:
            extracted_arrays[array_id] = array_values

# Creating main output CSV file
with open("output.csv", 'w', newline='') as csvfile:
    fieldnames_output = list(output[list(output.keys())[0]].keys())
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames_output)
    writer.writeheader()
    writer.writerows(output.values())

# Creating tags CSV file
with open("tags.csv", 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['tags'])
    writer.writerows([[tag] for tag in extracted_arrays['tags_id']])

# Creating comments CSV file
with open("comments.csv", 'w', newline='') as csvfile:
    fieldnames_com = extracted_arrays['comments_id'][0].keys()
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames_com)
    writer.writeheader()
    writer.writerows(extracted_arrays['comments_id'])

# Printing output and extracted arrays for easier visual control
print("Extracted arrays:")
for array_id, array_values in extracted_arrays.items():
    print(array_id, ":", array_values)

print("\nOutput:")
print(json.dumps(output, indent=2))

字符串
代码从MongoDB获取的JSON示例:

[{
  "_id": {
    "$oid": "50ab0f8bbcf1bfe2536dc3f9"
  },
  "body": "Amendment",
  "permalink": "aRjNnLZkJkTyspAIoRGe",
  "author": "machine",
  "title": "Bill of Rights",
  "tags": [
    "watchmaker",
    "santa",
    "xylophone",
    "math",
    "handsaw",
    "dream",
    "undershirt",
    "dolphin",
    "tanker",
    "action"
  ],
  "comments": [
    {
      "body": "Lorem ",
      "email": "HvizfYVx@pKvLaagH.com",
      "author": "Santiago Dollins"
    },
    {
      "body": "Lorem",
      "email": "WpOUCpdD@hccdxJvT.com",
      "author": "Jaclyn Morado"
    },
    {
      "body": "Loremid est laborum",
      "email": "OgDzHfFN@cWsDtCtx.com",
      "author": "Houston Valenti"
    }
  ],
  "date": {
    "$date": "2012-11-20T05:05:15.231Z"
  }
}]


我不确定是在创建process_json函数的阶段还是在写入CSV的阶段将post_id添加到所有的注解和标记中,以及一般如何做到这一点。任何建议将不胜感激!!

tyu7yeag

tyu7yeag1#

由于您还没有在示例JSON中包含预期输出CSV的示例,我真的不知道您想要什么。

我已经解释了你的问题和代码,意思是你想要的东西如下:

docs.csv
========
| id                       | body      | permalink            | author  | title          | date                     |
|--------------------------|-----------|----------------------|---------|----------------|--------------------------|
| 50ab0f8bbcf1bfe2536dc3f9 | Amendment | aRjNnLZkJkTyspAIoRGe | machine | Bill of Rights | 2012-11-20T05:05:15.231Z |

comments.csv
============
| id                       | body                | email                 | author           |
|--------------------------|---------------------|-----------------------|------------------|
| 50ab0f8bbcf1bfe2536dc3f9 | Lorem               | HvizfYVx@pKvLaagH.com | Santiago Dollins |
| 50ab0f8bbcf1bfe2536dc3f9 | Lorem               | WpOUCpdD@hccdxJvT.com | Jaclyn Morado    |
| 50ab0f8bbcf1bfe2536dc3f9 | Loremid est laborum | OgDzHfFN@cWsDtCtx.com | Houston Valenti  |

tags.csv
========
| id                       | tag        |
|--------------------------|------------|
| 50ab0f8bbcf1bfe2536dc3f9 | watchmaker |
| 50ab0f8bbcf1bfe2536dc3f9 | santa      |
| 50ab0f8bbcf1bfe2536dc3f9 | xylophone  |
| 50ab0f8bbcf1bfe2536dc3f9 | math       |
| 50ab0f8bbcf1bfe2536dc3f9 | handsaw    |
| 50ab0f8bbcf1bfe2536dc3f9 | dream      |
| 50ab0f8bbcf1bfe2536dc3f9 | undershirt |
| 50ab0f8bbcf1bfe2536dc3f9 | dolphin    |
| 50ab0f8bbcf1bfe2536dc3f9 | tanker     |
| 50ab0f8bbcf1bfe2536dc3f9 | action     |

字符串
如果你的想法和你想要的很接近,那就继续读下去。
我相信你只需要迭代注解和标签列表,对于你在那里找到的每一个东西,关联文档的ID并将其保存到它自己的结构中的一个新的“行”,所以我做了这些结构来保存最后的行:

doc_rows = []
comment_rows = []
tag_rows = []


我简化了你的process_json()函数,以定位我在示例JSON中看到的键和对象/dicts:

def process(doc: dict[str, Any]):
    """Get the ID for doc, then associate that ID with doc's comments and tags saving each in their own lists."""
    id_ = doc["_id"]["$oid"]

    for comment_dict in doc["comments"]:
        comment_rows.append({"id": id_, **comment_dict})

    for tag in doc["tags"]:
        tag_rows.append({"id": id_, "tag": tag})

    del doc["_id"]
    del doc["comments"]
    del doc["tags"]

    doc["date"] = doc["date"]["$date"]

    doc_rows.append({"id": id_, **doc})


关于这段代码需要注意的一些事情:

  • 我使用{"id": id_, ...}形式构建一个新的dict,以便id键位于第一位。{"id": id_, **comment_dict}意味着创建一个新的dict,其中的key为“id”,然后继续使用comment_dict中的所有键值对。
  • 将注解和标记列表保存到它们自己的行中后,我将它们从文档中删除,然后再将文档添加到它自己的行列表中。
  • 我还认为将date作为字符串值而不是JSON对象/Python dict可能会更好,所以我将dict替换为最终值。

在示例JSON上运行它会创建final _rows列表:

docs = json.load(open("input.json"))
for doc in docs:
    process(doc)

print(doc_rows)
# [
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "body": "Amendment", "permalink": "aRjNnLZkJkTyspAIoRGe", "author": "machine", "title": "Bill of Rights", "date": "2012-11-20T05:05:15.231Z"}
# ]
print(comment_rows)
# [
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "body": "Lorem ", "email": "HvizfYVx@pKvLaagH.com", "author": "Santiago Dollins"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "body": "Lorem", "email": "WpOUCpdD@hccdxJvT.com", "author": "Jaclyn Morado"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "body": "Loremid est laborum", "email": "OgDzHfFN@cWsDtCtx.com", "author": "Houston Valenti"},
# ]
print(tag_rows)
# [
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "watchmaker"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "santa"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "xylophone"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "math"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "handsaw"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "dream"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "undershirt"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "dolphin"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "tanker"},
#     {"id": "50ab0f8bbcf1bfe2536dc3f9", "tag": "action"},
# ]


从那里,创建CSV只需要创建DictWriter(就像你做的那样),尽管你可以从每个列表中给予每个DictWriter第一行来设置字段名:

with open("output/docs.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=doc_rows[0])
    writer.writeheader()
    writer.writerows(doc_rows)

with open("output/comments.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=comment_rows[0])
    writer.writeheader()
    writer.writerows(comment_rows)

with open("output/tags.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=tag_rows[0])
    writer.writeheader()
    writer.writerows(tag_rows)

相关问题