在Python中将嵌套JSON转换为CSV文件

qc6wkl3g  于 2022-12-06  发布在  Python
关注(0)|答案(5)|浏览(143)

我知道这个问题已经被问过很多次了。我试过几种解决方案,但我都不能解决我的问题。
我有一个大的嵌套JSON文件(1.4GB),我想让它平面,然后将其转换为CSV文件。
JSON结构如下所示:

{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

我知道这很容易用pandas模块完成,但我对它不熟悉。

已编辑

所需的输出应该如下所示:

company_number, address_line_1, locality, country_of_residence, kind,

12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control

注意,这只是简短的版本,输出应该包含所有字段。

kh212irz

kh212irz1#

请向下滚动查看更新、更快的解决方案

这是一个比较老的问题,但我整晚都在为类似的情况而挣扎,想得到一个令人满意的结果,我得出了这个结论:

import json
import pandas

def cross_join(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)

def json_to_dataframe(data_in):
    def to_frame(data, prev_key=None):
        if isinstance(data, dict):
            df = pandas.DataFrame()
            for key in data:
                df = cross_join(df, to_frame(data[key], prev_key + '.' + key))
        elif isinstance(data, list):
            df = pandas.DataFrame()
            for i in range(len(data)):
                df = pandas.concat([df, to_frame(data[i], prev_key)])
        else:
            df = pandas.DataFrame({prev_key[1:]: [data]})
        return df
    return to_frame(data_in)

if __name__ == '__main__':
    with open('somefile') as json_file:
        json_data = json.load(json_file)

    df = json_to_dataframe(json_data)
    df.to_csv('data.csv', mode='w')

说明:

cross_join**函数是我发现的一种执行笛卡尔积的简洁方法。(第10页)

json_to_dataframe函数使用panda Dataframe 来完成逻辑。在我的例子中,json是深度嵌套的,我想把字典key:value对拆分成列,但是我想把列表转换成列的行--因此有了concat --然后我把它与上层交叉连接,这样就增加了记录数,这样列表中的每个值都有自己的行,而前面的列是相同的。

递归创建的堆栈与下面的堆栈交叉联接,直到返回最后一个堆栈。
然后,使用表格格式的 Dataframe ,可以使用**”df.to_csv()"** Dataframe 对象方法轻松转换为CSV。
这应该适用于深度嵌套的JSON,能够通过上面描述的逻辑将其规范化为行。
我希望有一天,这能帮助到一些人。只是想给予这个令人敬畏的社区。

-———————————————————————————————————————————————————————————————————————————————————————————-
稍后编辑:全新解决方案

我回到这个主题,因为虽然dataframe选项可以工作,但它花了应用几分钟来解析不太大的JSON数据。因此,我想到做dataframe所做的,但由我自己:

from copy import deepcopy
import pandas

def cross_join(left, right):
    new_rows = [] if right else left
    for left_row in left:
        for right_row in right:
            temp_row = deepcopy(left_row)
            for key, value in right_row.items():
                temp_row[key] = value
            new_rows.append(deepcopy(temp_row))
    return new_rows

def flatten_list(data):
    for elem in data:
        if isinstance(elem, list):
            yield from flatten_list(elem)
        else:
            yield elem

def json_to_dataframe(data_in):
    def flatten_json(data, prev_heading=''):
        if isinstance(data, dict):
            rows = [{}]
            for key, value in data.items():
                rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))
        elif isinstance(data, list):
            rows = []
            for item in data:
                [rows.append(elem) for elem in flatten_list(flatten_json(item, prev_heading))]
        else:
            rows = [{prev_heading[1:]: data}]
        return rows

    return pandas.DataFrame(flatten_json(data_in))

if __name__ == '__main__':
    json_data = {
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
        "batters":
            {
                "batter":
                    [
                        {"id": "1001", "type": "Regular"},
                        {"id": "1002", "type": "Chocolate"},
                        {"id": "1003", "type": "Blueberry"},
                        {"id": "1004", "type": "Devil's Food"}
                    ]
            },
        "topping":
            [
                {"id": "5001", "type": "None"},
                {"id": "5002", "type": "Glazed"},
                {"id": "5005", "type": "Sugar"},
                {"id": "5007", "type": "Powdered Sugar"},
                {"id": "5006", "type": "Chocolate with Sprinkles"},
                {"id": "5003", "type": "Chocolate"},
                {"id": "5004", "type": "Maple"}
            ],
        "something": []
    }
    df = json_to_dataframe(json_data)
    print(df)

输出:

id   type  name   ppu batters.batter.id batters.batter.type topping.id              topping.type
0   0001  donut  Cake  0.55              1001             Regular       5001                      None
1   0001  donut  Cake  0.55              1001             Regular       5002                    Glazed
2   0001  donut  Cake  0.55              1001             Regular       5005                     Sugar
3   0001  donut  Cake  0.55              1001             Regular       5007            Powdered Sugar
4   0001  donut  Cake  0.55              1001             Regular       5006  Chocolate with Sprinkles
5   0001  donut  Cake  0.55              1001             Regular       5003                 Chocolate
6   0001  donut  Cake  0.55              1001             Regular       5004                     Maple
7   0001  donut  Cake  0.55              1002           Chocolate       5001                      None
8   0001  donut  Cake  0.55              1002           Chocolate       5002                    Glazed
9   0001  donut  Cake  0.55              1002           Chocolate       5005                     Sugar
10  0001  donut  Cake  0.55              1002           Chocolate       5007            Powdered Sugar
11  0001  donut  Cake  0.55              1002           Chocolate       5006  Chocolate with Sprinkles
12  0001  donut  Cake  0.55              1002           Chocolate       5003                 Chocolate
13  0001  donut  Cake  0.55              1002           Chocolate       5004                     Maple
14  0001  donut  Cake  0.55              1003           Blueberry       5001                      None
15  0001  donut  Cake  0.55              1003           Blueberry       5002                    Glazed
16  0001  donut  Cake  0.55              1003           Blueberry       5005                     Sugar
17  0001  donut  Cake  0.55              1003           Blueberry       5007            Powdered Sugar
18  0001  donut  Cake  0.55              1003           Blueberry       5006  Chocolate with Sprinkles
19  0001  donut  Cake  0.55              1003           Blueberry       5003                 Chocolate
20  0001  donut  Cake  0.55              1003           Blueberry       5004                     Maple
21  0001  donut  Cake  0.55              1004        Devil's Food       5001                      None
22  0001  donut  Cake  0.55              1004        Devil's Food       5002                    Glazed
23  0001  donut  Cake  0.55              1004        Devil's Food       5005                     Sugar
24  0001  donut  Cake  0.55              1004        Devil's Food       5007            Powdered Sugar
25  0001  donut  Cake  0.55              1004        Devil's Food       5006  Chocolate with Sprinkles
26  0001  donut  Cake  0.55              1004        Devil's Food       5003                 Chocolate
27  0001  donut  Cake  0.55              1004        Devil's Food       5004                     Maple

正如上面所做的那样,cross_join函数做的事情与 Dataframe 解决方案中的几乎相同,但没有 Dataframe ,因此速度更快。
我添加了flatten_list生成器,因为我想确保JSON数组都很好,并且是扁平的,然后作为一个字典列表提供,该列表由一次迭代中的前一个键组成,然后分配给列表的每个值。在本例中,这非常类似于pandas.concat的行为。
主函数json_to_dataframe中的逻辑和以前一样,需要改变的只是将 Dataframe 执行的操作作为编码函数。
此外,在 Dataframe 解决方案中,我没有将前面的标题附加到嵌套对象,但除非您100%确定列名中没有冲突,否则这是非常强制性的。
我希望这能有所帮助:)。

EDIT:修改了cross_join函数来处理嵌套列表为空的情况,基本上保持了之前的结果集不变。即使在示例JSON数据中添加了空的JSON列表,输出也没有变化。谢谢@Nazmus Sakib 指出。

jucafojl

jucafojl2#

对于给定的JSON数据,可以通过解析JSON结构来返回所有叶节点的列表。
这假设你的结构在整个过程中是一致的,如果每个条目可以有不同的字段,请参见第二种方法。
例如:

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = []
        for i in item.keys():
            leaves.extend(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = []
        for i in item:
            leaves.extend(get_leaves(i, key))
        return leaves
    else:
        return [(key, item)]

with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    write_header = True

    for entry in json.load(f_input):
        leaf_entries = sorted(get_leaves(entry))

        if write_header:
            csv_output.writerow([k for k, v in leaf_entries])
            write_header = False

        csv_output.writerow([v for k, v in leaf_entries])

如果您的JSON数据是您给定格式的条目列表,那么您应该得到如下输出:

address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,year
Address 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977

如果每个条目都包含不同的(或可能丢失的)字段,则更好的方法是使用DictWriter。在这种情况下,需要处理所有条目以确定可能的fieldnames的完整列表,从而可以写入正确的头。

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = {}
        for i in item.keys():
            leaves.update(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = {}
        for i in item:
            leaves.update(get_leaves(i, key))
        return leaves
    else:
        return {key : item}

with open('json.txt') as f_input:
    json_data = json.load(f_input)

# First parse all entries to get the complete fieldname list
fieldnames = set()

for entry in json_data:
    fieldnames.update(get_leaves(entry).keys())

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
    csv_output.writeheader()
    csv_output.writerows(get_leaves(entry) for entry in json_data)
wnavrhmk

wnavrhmk3#

你可以使用panda库的json_normalize函数来扁平化这个结构体,然后按照你的意愿来处理它。例如:

import pandas as pd
import json

raw = """[{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}]"""

data = json.loads(raw)
data = pd.json_normalize(data)
print(data.to_csv())

它为您提供:

,company_number,data.address.address_line_1,data.address.locality,data.address.postal_code,data.address.premises,data.address.region,data.country_of_residence,data.date_of_birth.month,data.date_of_birth.year,data.etag,data.kind,data.links.self,data.name,data.name_elements.forename,data.name_elements.middle_name,data.name_elements.surname,data.name_elements.title,data.nationality,data.natures_of_control,data.notified_on
0,12345678,Address 1,Henley-On-Thames,RG9 1DP,161,Oxfordshire,England,2,1977,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,individual-person-with-significant-control,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,John M Smith,John,M,Smith,Mrs,Vietnamese,['ownership-of-shares-50-to-75-percent'],2016-04-06
knpiaxh1

knpiaxh14#

提到Bogdan Mircea的回答,
这段代码几乎达到了我的目的!但是当它在嵌套的json中遇到一个空列表时,它返回一个空的 Dataframe 。
您可以通过在代码中添加以下内容来轻松解决这个问题

elif isinstance(data, list):
        rows = []
        if(len(data) != 0):
            for i in range(len(data)):
                [rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))]
        else:
            data.append(None)
            [rows.append(elem) for elem in flatten_list(flatten_json(data[0], prev_heading))]
c86crjj0

c86crjj05#

import pandas as pd
import json
import glob
from pandas.io.json import json_normalize

json_files = glob.glob("*.json")
dfs = []
for file in json_files:
    with open(file) as f:
        for line in f.readlines():
            df = pd.json_normalize(json.loads(line))
            list_= ['Item.dataId.S','Item.metadata.M.timestamp.S','Item.sensor.M.celcius.N','Item.sensor.M.water.N'']
            df = df.loc[:, df.columns.isin(list_)]
            dfs.append(df)
df_combine = pd.concat(dfs, sort=False)
df_combine.to_csv('json_to_raw.csv',index= None)

相关问题