Pandas|JSON规范化|深度嵌套节

5kgi1eie  于 2023-02-26  发布在  其他
关注(0)|答案(2)|浏览(200)

我有一个非常复杂/嵌套的JSON数据结构。
在阅读了SO上的其他解决方案后,我对我可能需要制作多个DF并将它们合并在一起(例如json_normalize with multiple record paths)这一事实感到放心。

{
  "topLevelIdentifier": "a86507",
  "activities": [
    {
      "activityId": "1f13d848",
      "taskList": [
        {
          "taskIdentifier": "c6ba-1012",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "3438"
        },
        {
          "taskIdentifier": "cc48-07bf",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "4b32"
        },
        {
          "taskIdentifier": "35b896",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "7bcca"
        }
      ],
      "sectionB": {
        "someCapacityA": {
          "totalVolume": {
            "value": 8415,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 1059.0,
            "unit": "GRAMS"
          }
        },
        "someCapacityB": {
          "totalVolume": {
            "value": 0.0,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 0.0,
            "unit": "GRAMS"
          }
        }
      },
 "sectionC": {....},
"sectionD": {....}, 
"sectonE":{...}
}]}

我可以通过以下操作创建合并的基础:

with open('my_file.json') as simulation_file:    
    simulation_data = json.load(simulation_file) 
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])

那么归一化下一部分也不算太糟:

df_taskList = pd.json_normalize(simulation_data['activities'],
                               'taskList',['activityId'])

它正在规范化下一个部分,这让我卡住了。下面是我正在尝试的,但抛出了一个错误:

df_sectionB = pd.json_normalize(simulation_data['activities'],
                                
                                'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])

我基本上是尝试将sectionB扁平化,使其具有activityId,以便可以将所有部分合并在一起。
如果有一种更简单的方法可以将像这样复杂的格式展平到最深的层,我会考虑其他的方法!

wz8daaqr

wz8daaqr1#

你可以试试flatten_json。我不得不从你的例子中删除C,D,E部分,所以不清楚这是否能如你所需的那样工作。参考文档以获得更多信息。
将字典设置为test_json

from flatten_json import flatten

dic_flattened = (flatten(d, '.') for d in [test_json]) 'note test_json needs to be in square brackets or you'll get an error
df = pd.DataFrame(dic_flattened)
df

  topLevelIdentifier activities.0.activityId  ... activities.0.sectionB.someCapacityB.totalWeight.value  activities.0.sectionB.someCapacityB.totalWeight.unit
0             a86507                1f13d848  ...                                               0.00000                                                 GRAMS

[1 rows x 22 columns]

' or transpose it to see the columns better
df.T

                                                               0
topLevelIdentifier                                        a86507
activities.0.activityId                                 1f13d848
activities.0.taskList.0.taskIdentifier                 c6ba-1012
activities.0.taskList.0.taskTime.value                         0
activities.0.taskList.0.taskTime.unit                    SECONDS
activities.0.taskList.0.taskLocationIdentifier              3438
activities.0.taskList.1.taskIdentifier                 cc48-07bf
activities.0.taskList.1.taskTime.value                         0
activities.0.taskList.1.taskTime.unit                    SECONDS
activities.0.taskList.1.taskLocationIdentifier              4b32
activities.0.taskList.2.taskIdentifier                    35b896
activities.0.taskList.2.taskTime.value                         0
activities.0.taskList.2.taskTime.unit                    SECONDS
activities.0.taskList.2.taskLocationIdentifier             7bcca
activities.0.sectionB.someCapacityA.totalVolume.value       8415
activities.0.sectionB.someCapacityA.totalVolume.unit       CU_CM
activities.0.sectionB.someCapacityA.totalWeight.value 1059.00000
activities.0.sectionB.someCapacityA.totalWeight.unit       GRAMS
activities.0.sectionB.someCapacityB.totalVolume.value    0.00000
activities.0.sectionB.someCapacityB.totalVolume.unit       CU_CM
activities.0.sectionB.someCapacityB.totalWeight.value    0.00000
activities.0.sectionB.someCapacityB.totalWeight.unit       GRAMS
bnl4lu3b

bnl4lu3b2#

您可以将.explode("taskList")添加到第一次规格化的结果中。
然后,可以规范化taskList列,复制基索引,并将结果连接起来。

df_base = pd.json_normalize(
   simulation_data, 
   record_path="activities", 
   meta="topLevelIdentifier"
).explode("taskList")

df = pd.concat(
   [ 
      df_base.drop(columns="taskList"),
      pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
   ],
   axis=1
)
>>> df.T
                                                   0           0           0
activityId                                  1f13d848    1f13d848    1f13d848
sectionC                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionD                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionE                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionB.someCapacityA.totalVolume.value        8415        8415        8415
sectionB.someCapacityA.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityA.totalWeight.value      1059.0      1059.0      1059.0
sectionB.someCapacityA.totalWeight.unit        GRAMS       GRAMS       GRAMS
sectionB.someCapacityB.totalVolume.value         0.0         0.0         0.0
sectionB.someCapacityB.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityB.totalWeight.value         0.0         0.0         0.0
sectionB.someCapacityB.totalWeight.unit        GRAMS       GRAMS       GRAMS
topLevelIdentifier                            a86507      a86507      a86507
taskIdentifier                             c6ba-1012   cc48-07bf      35b896
taskLocationIdentifier                          3438        4b32       7bcca
taskTime.value                                     0           0           0
taskTime.unit                                SECONDS     SECONDS     SECONDS

相关问题