pandas 如何合并在一列中的多个字典与相同的关键字为不同的行在嵌套框架

i34xakig  于 2023-11-15  发布在  其他
关注(0)|答案(3)|浏览(94)

下面是一个例子:

ID    area   data_info
a001   NY    [{color: 'Yellow', 'count': 3, 'weight': 5}, {color: 'Blue', 'count': 2, 'weight': 11} , {color: 'Red', 'count': 7, 'weight': 3}] 
a002   SF    [{color: 'Green', 'count': 1, 'weight': 14},{color: 'Yellow', 'count': 9, 'weight': 2}]
a003   NY    [{color: 'Blue', 'count': 5, 'weight': 6}, {color: 'Black', 'count': 2, 'weight': 15}]

字符串
具有相同面积的行将合并组合data_info中的信息,并基于相同的color键值聚合计数和权重。
预期结果:
ID a001和a003都具有相同的区域NY,因此data_info的值将合并并聚合到一行中。由于a001和a003在字典中的颜色为:Blue,因此它将聚合count的总和和weight的总和。

area   data_info
  NY    [{color: 'Yellow', 'count': 3, 'weight': 5}, {color: '**Blue**', 'count': **7**, 'weight': **17**} , {color: 'Red', 'count': 7, 'weight': 3}, {color: 'Black', 'count': 2, 'weight': 15}] 
  SF    [{color: 'Green', 'count': 1, 'weight': 14},{color: 'Yellow', 'count': 9, 'weight': 2}]

wwtsj6pe

wwtsj6pe1#

  • 假设您在data_info(字典列表)中具有有效的数据结构。*

您可以使用pd.json_normalizedata_info列中提取数据:

# Create an index to allow merging the area column
idx = df.index.repeat(df['data_info'].str.len())

# Extract data and set the area column
dat = (pd.json_normalize(df['data_info'].explode())
         .assign(area=df.loc[idx, 'area'].values))

# Compute the sum
dat = dat.groupby(['color', 'area']).sum().reset_index('color')

# Create the expected output
out = (pd.DataFrame({'area': dat.index, 'data_info': dat.to_dict('records')})
         .groupby('area', as_index=False).agg(list))

字符串
输出量:

>>> out
  area  data_info
0   NY  [{'color': 'Black', 'count': 2, 'weight': 15}, {'color': 'Blue', 'count': 7, 'weight': 17}, {'color': 'Red', 'count': 7, 'weight': 3}, {'color': 'Yellow', 'count': 3, 'weight': 5}]
1   SF  [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}]


中间步骤

>>> (pd.json_normalize(df['data_info'].explode())
       .assign(area=df.loc[idx, 'area'].values))

    color  count  weight area
0  Yellow      3       5   NY
1    Blue      2      11   NY
2     Red      7       3   NY
3   Green      1      14   SF
4  Yellow      9       2   SF
5    Blue      5       6   NY
6   Black      2      15   NY

>>> dat.groupby(['color', 'area']).sum().reset_index('color')
       color  count  weight
area                       
NY     Black      2      15
NY      Blue      7      17
SF     Green      1      14
NY       Red      7       3
NY    Yellow      3       5
SF    Yellow      9       2

gkl3eglg

gkl3eglg2#

经典的for循环方式。它在结尾创建一个文件(result.txt)以更好地查看结果。

验证码:

import pandas as pd
data = {
    'ID': ['a001', 'a002', 'a003'],
    'area': ['NY', 'SF', 'NY'],
    'data_info': [
        [{'color': 'Yellow', 'count': 3, 'weight': 5}, {'color': 'Blue', 'count': 2, 'weight': 11}, {'color': 'Red', 'count': 7, 'weight': 3}],
        [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}],
        [{'color': 'Blue', 'count': 5, 'weight': 6}, {'color': 'Black', 'count': 2, 'weight': 15}]
    ]
}
df = pd.DataFrame(data)

def combine_data_info(group):
    combined_data = {}
    for _, row in group.iterrows():
        for entry in row['data_info']:
            color = entry['color']
            if color not in combined_data:
                combined_data[color] = {'color': color, 'count': 0, 'weight': 0}
            combined_data[color]['count'] += entry['count']
            combined_data[color]['weight'] += entry['weight']
    return list(combined_data.values())

result = df.groupby('area').apply(combine_data_info).reset_index(name='data_info')

result.to_csv('result.txt', sep='\t', index=False)

字符串

输出:

area    data_info
NY  [{'color': 'Yellow', 'count': 3, 'weight': 5}, {'color': 'Blue', 'count': 7, 'weight': 17}, {'color': 'Red', 'count': 7, 'weight': 3}, {'color': 'Black', 'count': 22, 'weight': 165}]
SF  [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}]

nzrxty8p

nzrxty8p3#

使用自定义函数和groupby.agg

from itertools import chain
def agg_dict(s):
    return (pd.DataFrame(chain.from_iterable(s))
              .groupby('color', as_index=False).sum()
              .to_dict('records')
           )

out = df.groupby('area', as_index=False)['data_info'].agg(agg_dict)

字符串
输出量:

area                                                                                                                                                                             data_info
0   NY  [{'color': 'Black', 'count': 2, 'weight': 15}, {'color': 'Blue', 'count': 7, 'weight': 17}, {'color': 'Red', 'count': 7, 'weight': 3}, {'color': 'Yellow', 'count': 3, 'weight': 5}]
1   SF                                                                                          [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}]

相关问题