pandas 在一个数据框内查找一组总体的平均值/标准差和剩余总体的平均值/标准差

dgsult0t  于 2023-02-14  发布在  其他
关注(0)|答案(1)|浏览(158)

我有一个Pandas的数据框架,看起来像这样:

id  age  weight  group
1    12    45    [10-20]
1    18    110   [10-20]
1    25    25    [20-30]
1    29    85    [20-30]
1    32    49    [30-40]
1    31    70    [30-40]
1    37    39    [30-40]

我正在寻找一个 Dataframe ,看起来像这样:(sd=标准偏差)

group   group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
 [10-20]                       
 [20-30] 
 [30-40]

这里第二/三列是该组的平均值和SD。第三和第四列是其余组合并的平均值和SD。

thtygnil

thtygnil1#

下面是一个方法:

res = df.group.to_frame().groupby('group').count()
for group in res.index:
    mask = df.group==group
    srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
    res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
        srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()

输出:

group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

获得相同结果的另一种方法是:

res = ( pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,'weight'].mean(), 
            df.loc[df.group==x.group,'weight'].std(), 
            df.loc[df.group!=x.group,'weight'].mean(), 
            df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
    .reset_index().rename(columns={'index':'group'}) )

输出:

group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596
    • 更新**:OP在评论中询问:"如果我有一个以上的权重列怎么办?如果我有大约10个不同的权重列,并且我希望对所有权重列都使用sd怎么办?"

为了在下面说明,我创建了两个权重列(weightweight2),并简单地为每个权重列提供了所有4个聚合(平均值、标准差、其他值的平均值、其他值的标准差)。

wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,wgtCol].mean(), 
            df.loc[df.group==x.group,wgtCol].std(), 
            df.loc[df.group!=x.group,wgtCol].mean(), 
            df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
    for wgtCol in wgtCols], axis=1)
    .reset_index().rename(columns={'index':'group'}) )

输入:

id  age  weight  weight2    group
0   1   12      45       55  [10-20]
1   1   18     110      120  [10-20]
2   1   25      25       35  [20-30]
3   1   29      85       95  [20-30]
4   1   32      49       59  [30-40]
5   1   31      70       80  [30-40]
6   1   37      39       49  [30-40]

输出:

group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight  group_mean_weight2  group_sd_weight2  rest_mean_weight2  rest_sd_weight2
0  [10-20]          77.500000        45.961941             53.60       24.016661           87.500000         45.961941              63.60        24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411           65.000000         42.426407              72.60        28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596           62.666667         15.821926              76.25        38.378596

相关问题