如何使用Python将复杂的嵌套JSON转换为csv?

ozxc1zmp  于 2023-07-31  发布在  Python
关注(0)|答案(2)|浏览(90)

我想使用Python将下面的嵌套JSON文件转换为CSV文件。

{
    "page": {
        "page": 1,
        "pageSize": 250
    },
    "dataRows": [
        {
            "entityId": 349255,
            "Id": "41432-95P",
            "disabled": false,
            "followed": false,
            "suggestion": false,
            "inactive": false,
            "pinned": false,
            "highlighted": false,
            "columnValues": {
                "lastName": [
                    {
                        "columnValueType": "ENTITY",
                        "accessStatus": "OK",
                        "columnValueType": "ENTITY",
                        "name": "McBrady",
                        "Id": "41432-95P",
                        "unpublished": false
                    }
                ],
                "gender": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Male"
                    }
                ],
                "hqCity": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Seattle"
                    }
                ],
                "prefix": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Dr."
                    }
                ],
                "lastUpdateDate": [
                    {
                        "columnValueType": "DATE",
                        "accessStatus": "OK",
                        "columnValueType": "DATE",
                        "expected": false,
                        "asOfdate": "2023-06-26"
                    }
                ],
                "companyName": [
                    {
                        "columnValueType": "BUSINESS_ENTITY",
                        "accessStatus": "OK",
                        "columnValueType": "BUSINESS_ENTITY",
                        "name": "Global Partnerships",
                        "Id": "56347-39",
                        "unpublished": false,
                        "profileType": "INVESTOR"
                    }
                ],
                "roles": [
                    {
                        "columnValueType": "INT_COLUMN_VALUE",
                        "accessStatus": "OK",
                        "columnValueType": "INT_COLUMN_VALUE",
                        "marked": false,
                        "value": 3
                    }
                ],
                "dailyUpdates": [],
                "assetClass": [],
                "hqCountry": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "United States"
                    }
                ],
                "latestNoteAuthor": [],
                "primaryPosition": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Chair, Enterprise Risk, Compliance and Audit Committee and Member of the Board of Directors"
                    }
                ],
                "boardSeats": [
                    {
                        "columnValueType": "INT_COLUMN_VALUE",
                        "accessStatus": "OK",
                        "columnValueType": "INT_COLUMN_VALUE",
                        "marked": false,
                        "value": 2
                    }
                ],
                "fundRoles": [],
                "institution": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Harvard University"
                    },
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "University of Oxford"
                    }
                ],
                "latestNote": [],
                "Id": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "41432-95P"
                    }
                ],
                "hqRegion": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Americas"
                    }
                ],
                "email": [],
                "dealRoles": [],
                "PrimaryCompanyType": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Not-For-Profit Venture Capital"
                    }
                ],
                "mgtRoles": [],
                "hqStateProvince": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Washington"
                    }
                ],
                "fullName": [
                    {
                        "columnValueType": "ENTITY_WITH_NOTE",
                        "accessStatus": "OK",
                        "columnValueType": "ENTITY_WITH_NOTE",
                        "name": "Matthew McBrady Ph.D",
                        "Id": "41432-95P",
                        "unpublished": false
                    }
                ],
                "hqLocation": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Seattle, WA"
                    }
                ],
                "biography": [
                    {
                        "columnValueType": "DESCRIPTION_WITH_SOURCE",
                        "accessStatus": "OK",
                        "columnValueType": "DESCRIPTION_WITH_SOURCE",
                        "value": "Dr. Matthew McBrady serves as Chair, of the Enterprise Risk, Compliance, and Audit Committee.",
                        "morningstarSource": true
                    }
                ],
                "firstName": [
                    {
                        "columnValueType": "ENTITY",
                        "accessStatus": "OK",
                        "columnValueType": "ENTITY",
                        "name": "Matthew",
                        "Id": "41432-95P",
                        "unpublished": false
                    }
                ],
                "phone": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "+1 (206) 652-8773"
                    }
                ],
                "hqSubRegion": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "North America"
                    }
                ],
                "hqAddressLine2": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "Suite 410"
                    }
                ],
                "hqAddressLine1": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "1201 Western Avenue"
                    }
                ],
                "hqFax": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "+1 (206) 456-7877"
                    }
                ],
                "middleName": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "R."
                    }
                ],
                "companyWebsite": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "www.globalpartnerships.org"
                    }
                ],
                "hqZipCode": [
                    {
                        "columnValueType": "STRING",
                        "accessStatus": "OK",
                        "columnValueType": "STRING",
                        "value": "98101"
                    }
                ],
                "weeklyUpdates": []
            }
        }
    ]
}

字符串
在使用chatgpt迭代后,我得到了以下代码。但是,我无法让它捕获包含entityId的第一个嵌套。而且,在随后的每一个巢中,我都想捕获所有的字段;例如,在“lastName”中,我希望同时捕获“name”和“Id”,类似地,在“companyName”中,我希望“name”、“Id”和“profileType”位于单独的列中。正如我在chatgpt中提到的,我不关心“columnValueType”、“accessStatus”或“unpublished”。
下面是Python代码:

import csv
import json

def extract_field_value(data):
    if isinstance(data, dict):
        if 'value' in data:
            return str(data['value'])
        elif 'columnValueType' in data and data['columnValueType'] == 'ENTITY':
            return str(data['name'])
        else:
            values = []
            for key, value in data.items():
                if key not in ['columnValueType', 'accessStatus', 'unpublished']:
                    values.append(extract_field_value(value))
            return ', '.join(values) if values else ''
    elif isinstance(data, list):
        values = []
        for item in data:
            value = extract_field_value(item)
            if value:
                values.append(value)
        return ', '.join(values) if values else ''
    else:
        return str(data) if data is not None else ''

# Read the JSON data
with open('data.json') as file:
    data = json.load(file)

# Extract the nested data rows
data_rows = data['dataRows']

# Extract the column headers from the first data row
column_headers = list(data_rows[0]['columnValues'].keys())

# Create a CSV file
with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)

    # Write the column headers as the first row
    writer.writerow(column_headers)

    # Write each data row as a separate row in the CSV file
    for row in data_rows:
        column_values = row['columnValues']
        csv_row = []
        for column_header in column_headers:
            values = column_values.get(column_header, [])
            value = extract_field_value(values)
            csv_row.append(value)
        writer.writerow(csv_row)

print("CSV file created successfully.")

tnkciper

tnkciper1#

我看到你试图写一个更通用的程序来弄清楚JSON的结构。我认为这会更容易,至少在一开始,因为你知道这个结构,只是让你的代码知道它。非常明确,在这种情况下,它看起来简单得多。
这种方法还利用了csv模块中的DictWriter类,因此最后一行中有标题...不需要单独跟踪。我也喜欢打字,所以我为我看到的JSON结构添加了类型提示(特别是columnValues)。

import csv
import json

from typing import Any

def extract_columns(col_values: dict[str, list[dict[str, Any]]]) -> dict[str, Any]:
    """
    Flatten the dict of columnValues down to a single dict.  Each columnValue key becomes the
    prefix for the value keys that follow.  Assumes that each column value is a single-item item list.
    """
    row: dict[str, Any] = {}

    for col_name, list_of_vals in col_values.items():
        if list_of_vals == []:
            continue

        col_name = col_name.upper()

        vals = list_of_vals[0]
        for k, v in vals.items():
            if k in ["columnValueType", "accessStatus", "unpublished"]:
                continue

            row[col_name + "_" + k] = v

    return row

def extract_row(data_row: dict[str, Any]) -> dict[str, Any]:
    row: dict[str, Any] = {}

    for k in ["Id", "disabled", "followed", "suggestion", "inactive", "pinned", "highlighted"]:
        row[k] = data_row[k]

    row.update(extract_columns(data_row["columnValues"]))

    return row

csv_rows: list[dict[str, Any]] = []
with open("input.json") as f:
    data = json.load(f)

    for data_row in data["dataRows"]:
        row = extract_row(data_row)
        csv_rows.append(row)

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=csv_rows[0])
    writer.writeheader()
    writer.writerows(csv_rows)

字符串
当我在你的示例JSON上运行它时,第一行最终看起来像这样:
| 最终值| Final value |
| --| ------------ |
| 41432-95P| 41432-95P |
| 假的| False |
| 假的| False |
| 假的| False |
| 假的| False |
| 假的| False |
| 假的| False |
| 麦克布雷迪| McBrady |
| 41432-95P| 41432-95P |
| 男性| Male |
| 西雅图| Seattle |
| 博士| Dr. |
| 假的| False |
| 2023-06-26 2023-06-26| 2023-06-26 |
| 全球伙伴关系| Global Partnerships |
| 56347-39| 56347-39 |
| 投资者关系| INVESTOR |
| 假的| False |
| 三个| 3 |
| 美利坚合众国Name| United States |
| 企业风险、合规和审计委员会主席兼董事会成员| Chair, Enterprise Risk, Compliance and Audit Committee and Member of the Board of Directors |
| 假的| False |
| 二个| 2 |
| 哈佛大学| Harvard University |
| 41432-95P| 41432-95P |
| 美洲国家| Americas |
| 非营利风险投资| Not-For-Profit Venture Capital |
| 华盛顿| Washington |
| Matthew McBrady博士| Matthew McBrady Ph.D |
| 41432-95P| 41432-95P |
| 西雅图,WA| Seattle, WA |
| Matthew McBrady博士担任企业风险、合规和审计委员会主席。||
| 真的| True |
| 马修| Matthew |
| 41432-95P| 41432-95P |
| 电话:+1(206)652-8773传真:+1(206)652-8773| +1 (206) 652-8773 |
| 北美洲| North America |
| 410套房| Suite 410 |
| 西大街1201号| 1201 Western Avenue |
| 电话:+1(206)456-7877传真:+1(206)456-7877| +1 (206) 456-7877 |
| n. R| R. |
| |<www.globalpartnerships.org>|
| 九八一零一| 98101 |
我知道这与您的代码所显示的内容有所不同,但我只想给予您一个粗略的概念(方法)。我相信你可以调整代码,特别是extract_columns,以满足你的需要。

nzrxty8p

nzrxty8p2#

您可以尝试以下示例如何将Json解析为DataFrame:

import json
import pandas as pd

with open('your_file.json', 'r') as f_in:
    data = json.load(f_in)

out = []
for r in data['dataRows']:
    d = {}
    for k, v in r['columnValues'].items():
        if not v:
            d[k] = None
        else:
            del v[0]['accessStatus']
            del v[0]['columnValueType']
            for kk, vv in v[0].items():
                d[f'{k}.{kk}'] = vv
    out.append(d)

df = pd.DataFrame(out)
print(df)

字符串
图纸:

lastName.name lastName.Id  lastName.unpublished gender.value hqCity.value prefix.value  lastUpdateDate.expected lastUpdateDate.asOfdate     companyName.name companyName.Id  companyName.unpublished companyName.profileType  roles.marked  roles.value dailyUpdates assetClass hqCountry.value latestNoteAuthor                                                                        primaryPosition.value  boardSeats.marked  boardSeats.value fundRoles   institution.value latestNote   Id.value hqRegion.value email dealRoles        PrimaryCompanyType.value mgtRoles hqStateProvince.value         fullName.name fullName.Id  fullName.unpublished hqLocation.value                                                                                biography.value  biography.morningstarSource firstName.name firstName.Id  firstName.unpublished        phone.value hqSubRegion.value hqAddressLine2.value hqAddressLine1.value        hqFax.value middleName.value        companyWebsite.value hqZipCode.value weeklyUpdates
0       McBrady   41432-95P                 False         Male      Seattle          Dr.                    False              2023-06-26  Global Partnerships       56347-39                    False                INVESTOR         False            3         None       None   United States             None  Chair, Enterprise Risk, Compliance and Audit Committee and Member of the Board of Directors              False                 2      None  Harvard University       None  41432-95P       Americas  None      None  Not-For-Profit Venture Capital     None            Washington  Matthew McBrady Ph.D   41432-95P                 False      Seattle, WA  Dr. Matthew McBrady serves as Chair, of the Enterprise Risk, Compliance, and Audit Committee.                         True        Matthew    41432-95P                  False  +1 (206) 652-8773     North America            Suite 410  1201 Western Avenue  +1 (206) 456-7877               R.  www.globalpartnerships.org           98101          None

相关问题