基于Pandas中一列的NA值对另一列进行动态求和

djmepvbi  于 2022-12-02  发布在  其他
关注(0)|答案(1)|浏览(213)

我有一个有序的 Dataframe df,它按“ID”分组,按“order”排序

df = pd.DataFrame(
    {'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
     'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
     'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
     'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
     }
)

在每个ID组中,我需要对col2为NA的那些行的col1求和。求和包括col1的值,col2的下一个值存在于该值中:
我更喜欢矢量化的解决方案,使它更快,但这可能是困难的。我需要在groupby中使用它(因为col1_dynamic_sum应该按ID分组)
到目前为止,我所做的是定义一个函数,帮助计算该行中先前连续的NAs的数量:

def count_prev_consec_na(input_col):
    """
    This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
    """    
    try:
        a1 = input_col.isna() + 0 ## missing
        a2 = ~input_col.isna() + 0  ## not missing
        b1 = a1.shift().fillna(0) ## prev missing
        d = a1.cumsum()
        e = b1*a2
        f = d*e
        g = f.replace(0, np.nan)
        h=g.ffill()
        h = h.fillna(0)
        i = h.shift()
        result = h-i
        result = result.fillna(0)

        return (result)

    except Exception as e:
        print(e.message)
        return None

我认为一个解决方案是使用此函数来获取需要回滚以进行求和的动态行数:

df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)

ID  order   col1    col2    roll_back_count
A   1       1.0     1.0         0.0
A   3       2.0     5.0         0.0
A   4       NaN     6.0         0.0
A   6       1.0     NaN         0.0  
A   7       2.0     1.0         1.0    ## I want to sum col1 of order 6 and 7 and remove order 6 row
A   9       3.0     2.0         0.0
A   11      4.0     3.0         0.0
A   12      5.0     NaN         0.0
A   13      NaN     2.0         1.0   ## I want to sum col1 of order 12 and 13 and remove order 12 row
A   14      NaN     3.0         0.0
A   15      6.0     NaN         0.0
A   16      7.0     NaN         0.0
A   19      8.0     3.0         2.0   ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A   25      9.0     1.0         0.0
B   8       NaN     5.0         0.0
B   10      NaN     NaN         0.0
B   15      NaN     NaN         0.0
B   17      10.0    NaN         0.0   ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B   20      11.0    2.0         3.0
B   25      12.0    3.0         0.0
B   29      NaN     NaN         0.0
B   31      13.0    NaN         0.0

这是我想要的输出:

desired_output:

    ID  order   col1_dynamic_sum    col2
    A   1       1.0                 1
    A   3       2.0                 5
    A   4       NaN                 6
    A   7       3.0                 1
    A   9       3.0                 2
    A   11      4.0                 3
    A   13      5.0                 2
    B   14      NaN                 3
    B   19      21.0                3
    B   25      9.0                 1
    B   8       NaN                 5
    B   20      21.0                2
    B   25      12.0                3

注意:求和应该再次忽略NAs,我更喜欢矢量化的解决方案,但由于滚动效应,它可能是不可能的。

qpgpyjmq

qpgpyjmq1#

我想我找到了一个根本不需要滚动的解决方案!我使用没有任何值的行的索引,基于col 2的NA值创建了一个新的分组ID。然后我将使用此分组ID进行聚合!

def create_na_group(rollback_col):
    a = ~rollback_col.isna() + 0
    b = a.replace(0, np.nan)
    c = rollback_col.index
    d = c*b
    d = d.bfill()
    return(d)

df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)

df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
      order = ('order', 'last')
    , col1_dyn_sum = ('col1', sum)
    , col2 = ('col2', sum)
)

我只需要找到一种方法来确保NaN不会变成0,就像第2,7和10行那样。

ID  na_group    order   col1_dyn_sum    col2
0   A       0.0     1       1.0             1.0
1   A       1.0     3       2.0             5.0
2   A       2.0     4       0.0             6.0
3   A       4.0     7       3.0             1.0
4   A       5.0     9       3.0             2.0
5   A       6.0     11      4.0             3.0
6   A       8.0     13      5.0             2.0
7   A       9.0     14      0.0             3.0
8   A       12.0    19      21.0            3.0
9   A       13.0    25      9.0             1.0
10  B       14.0    8       0.0             5.0
11  B       18.0    20      21.0            2.0
12  B       19.0    25      12.0            3.0

我将创建两个单独的求和列,其中包含Lamba x:x.总和(skipna =假)和λ x:x.sum(skipna = True),然后如果skipna = True sum列为0且skipna = False sum列为NA,则我将最终的和保留为NA,否则,我使用skipna = True sum列作为最终所需的输出。

相关问题