我有以下数据框
df = pd.DataFrame({"group1":["A", "A", "A","B","A","B","B","B","B","B","A","A","B"],
"group2":["1", "1", "2","1","2","2","2","1","2","1","1","1","2"],
"date":['2022-11-01', '2022-11-01', '2022-11-02', '2022-11-01', '2022-11-01',
'2022-11-01', '2022-11-02', '2022-11-02','2022-11-01', '2022-11-01', '2022-11-02', '2022-11-02', '2022-11-02'],
"value":np.random.randint(10, high=50, size=13)})
我想计算“日期”上的累积计数、累积平均值和累积方差,按“组1”和“组2”分组。
下面几行代码就能做到,但我觉得它相当笨拙。有更好的方法吗?
# sort
tmp = df.sort_values(["date", "group1", "group2"])
# cum mean
tmp2 = tmp.groupby(["group1", "group2"])["value"].expanding().mean().reset_index()
# cum var
tmp2["var"] = tmp.groupby(["group1", "group2"])["value"].expanding().var().values
# set old index in order to get the date from original df
tmp2 = tmp2.reset_index().set_index("level_2")
tmp2 = pd.concat([tmp["date"], tmp2], axis=1).drop(['index'], axis=1) # remove "index" col
# get the cum mean and cum var for each date
tmp2 = tmp2.groupby(["group1", "group2", "date"]).agg(cnt=("value", "count"), mean=("value", "last"), var=("var", "last")).reset_index()
# create cum count column
tmp2["cumcnt"] = tmp2.groupby(["group1", "group2"])["cnt"].cumsum()
# group by
tmp2.groupby(["group1", "group2", "date"]).last()
返回以下 Dataframe
1条答案
按热度按时间bfnvny8b1#
我使用多索引透视图,因为会有太多的NAN-hole。