json 任何深度嵌套dict到panda Dataframe

emeijp43  于 2022-11-19  发布在  其他
关注(0)|答案(2)|浏览(146)

我一直在努力从深度为 D 的嵌套字典转换为PandasDataFrame。
我已经尝试过递归函数,就像下面这个,但是我的问题是,当我迭代一个KEY时,我不知道前一个key是什么。
我也试过用json。normalize,dict的Pandas,但是我总是在列中以点结束...
示例代码:

def iterate_dict(d, i = 2, cols = []):
    
    for k, v in d.items():
        # missing here how to check for the previous key
        # so that I can create an structure to create the dataframe.
        if type(v) is dict:   
            print('this is k: ', k)  
            if i % 2  == 0:
                cols.append(k)
            i+=1
            iterate_dict(v, i, cols)            
        else:
            print('this is k2: ' , k, ': ', v)

  
iterate_dict(test2)

下面是我的字典的一个例子:

# example 2 
test = {
    'column-gender': {
        'male': {
            'column-country' : {
                'FRENCH': {
                    'column-class': [0,1]
                },
                ('SPAIN','ITALY') : {
                    'column-married' : {
                        'YES': {
                            'column-class' : [0,1]
                        },
                        'NO' : {
                            'column-class' : 2
                        }
                    }
                }
            }
        },
        'female': {
            'column-country' : {
                ('FRENCH', 'SPAIN') : {
                    'column-class' : [[1,2],'#']
                },
                'REST-OF-VALUES': {
                    'column-married' : '*'
                }
            }
        }
    }
}

我希望 Dataframe 看起来像这样:

欢迎任何建议:)

q8l4jmvw

q8l4jmvw1#

如果column-keys始终以column-为前缀,则可以创建一个递归函数:

def data_to_df(data):
    rec_out = []
    def dict_to_rec(d, curr_row={}):
        for k, v in d.items():
            if 'column-' in k: # definition of a column
                if isinstance(v, dict):
                    for val, nested_dict in v.items():
                        dict_to_rec(nested_dict, dict(curr_row, **{k[7:]: val}))
                else:
                    rec_out.append(dict(curr_row, **{k[7:]: v}))
    dict_to_rec(data)
    return pd.DataFrame(rec_out)

print(data_to_df(test))

编辑:删除不必要的变量和参数
输出量:

gender          country        class married
0    male           FRENCH       [0, 1]     NaN
1    male   (SPAIN, ITALY)          YES  [0, 1]
2    male   (SPAIN, ITALY)           NO       2
3  female  (FRENCH, SPAIN)  [[1, 2], #]     NaN
4  female   REST-OF-VALUES            *     NaN
h79rfbju

h79rfbju2#

我不确定数据如何保持一致,但为了便于理解,我们可以做一些类似以下的事情,请记住,这只是一个关于我们如何处理数据的方法的小演示,您可以花更多时间进行相应的润色:
为了更好地理解,我在每个步骤上都添加了注解。

import pandas as pd

def nested_dict_to_df(data, columns=None):

    if columns are None:
        columns = []

    # if the data is a dictionary, then we need to iterate over the keys
    if isinstance(data, dict):

        for key, value in data.items():
            columns.append(key)
            yield from nested_dict_to_df(value, columns)  # recursive call
            columns.pop()  # remove the last element
    else:
        yield columns + [data]

df = pd.DataFrame(nested_dict_to_df(data))

# Drop column [0, 2, 4, 6] from the dataframe that are not needed for the final output
df = df.drop(df.columns[[0, 2, 4, 6]], axis=1)

header = ["GENDER", "COUNTRY", "CLASS", "MARRIED"]  # Desired header
df.columns = header

print(df)

输出量:

GENDER          COUNTRY        CLASS MARRIED
0    male           FRENCH       [0, 1]    None
1    male   (SPAIN, ITALY)          YES  [0, 1]
2    male   (SPAIN, ITALY)           NO       2
3  female  (FRENCH, SPAIN)  [[1, 2], #]    None
4  female   REST-OF-VALUES            *    None

相关问题