Pandas缺失分类的累积和

cvxl0en2 于 2023-04-04 发布在其他

关注(0)|答案(1)|浏览(98)

假设我有以下数据集

df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4], 
           'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)

df.set_index('unit', inplace = True)

它看起来像这样：

计数给出了在一个单元中观察到的不同类别的频率。我想得到的是每个单元四个类别的累积频率。请注意，单元1中缺少类别4，单元2中缺少类别3。
因此，最终结果将是
对于单元1：

[8/13, 11/13, 13/13, 13/13]

对于单元2：

[2/17, 10/17, 10/17, 17/17]

我知道如何得到groupby和cumsum的累积和，但是例如，单元1没有缺失类别4的值。
谢谢你的时间！

pandas

来源：https://stackoverflow.com/questions/19235264/cumulative-sum-with-missing-categories-in-pandas

1条答案

按热度按时间

s4n0splo1#

import pandas as pd

df_dict = ({'unit' : [1, 1, 1, 2, 2, 2], 'cat' : [1, 2, 3, 1, 2, 4], 
           'count' : [8, 3, 2, 2, 8, 7] })
df = pd.DataFrame(df_dict)

df.set_index('unit', inplace = True)    

cumsum_count = df.groupby(level=0).apply(lambda x: pd.Series(x['count'].cumsum().values, index=x['cat']))
# unit  cat
# 1     1       8
#       2      11
#       3      13
# 2     1       2
#       2      10
#       4      17
# dtype: int64

cumsum_count = cumsum_count.unstack(level=1).fillna(method='ffill', axis=1)
# cat   1   2   3   4
# unit               
# 1     8  11  13  13
# 2     2  10  10  17

totals = df.groupby(level=0)['count'].sum()
# unit
# 1       13
# 2       17
# Name: count, dtype: int64

cumsum_dist = cumsum_count.div(totals, axis=0)
print(cumsum_dist)

产量

cat          1         2         3  4
unit                                 
1     0.615385  0.846154  1.000000  1
2     0.117647  0.588235  0.588235  1

我真的不知道如何解释这个解决方案--可能是因为我是偶然得到这个解决方案的。

s.apply(lambda x: pd.Series(1, index=x))

将值与索引相关联。一旦您将累积计数（values），例如[8，11，13]与cat数字（index），例如[1，2，3]相关联，您基本上就可以自由了。其余的只是unstack，fillna，div和groupby的标准应用程序。

赞(0）回复(0）举报 2023-04-04

我来回答

Pandas缺失分类的累积和

1条答案

相关问题

热门标签

最新问答