pandas 如何在numpy数组中找到下一个非NaN值的距离

xggvc2p6  于 2023-09-29  发布在  其他
关注(0)|答案(2)|浏览(101)

考虑以下数组:

arr = np.array(
    [
        [10, np.nan],
        [20, np.nan],
        [np.nan, 50],
        [15, 20],
        [np.nan, 30],
        [np.nan, np.nan],
        [10, np.nan],
        
    ]
)

对于arr中每列的每个单元格,我需要找到到下一个非NaN值的距离。也就是说,预期的结果应该是这样的:

expected = np.array(
    [
        [1, 2],
        [2, 1],
        [1, 1],
        [3, 1],
        [2, np.nan],
        [1, np.nan],
        [np.nan, np.nan]
    ]
)
2w2cym1i

2w2cym1i1#

使用pandas,你可以用maskshift计算一个反向的cumcount

out = (pd.DataFrame(arr).notna()[::-1]
         .apply(lambda s: s.groupby(s.cumsum()).cumcount().add(1)
                           .where(s.cummax()).shift()[::-1])
         .to_numpy()
      )

输出量:

array([[ 1.,  2.],
       [ 2.,  1.],
       [ 1.,  1.],
       [ 3.,  1.],
       [ 2., nan],
       [ 1., nan],
       [nan, nan]])
hc2pp10m

hc2pp10m2#

你可能会得到一些perf加速,通过二进制搜索和一些numpy函数的组合:

box = []
for num in range(arr.shape[-1]):
    temp=arr[:, num]
    # this section gets the non-nan positions
    bools = ~np.isnan(temp)
    bools = bools.nonzero()[0]
    # this section gets positions of all indices 
    # with respect to the non-nan positions
    # note the use of side='right' to get the closest non-nan position
    positions = np.arange(temp.size)
    bool_positions = bools.searchsorted(positions, side='right')
    # out of bound positions are replaced with nan
    filtered=bool_positions!=bools.size
    blanks=np.empty(temp.size, dtype=float)
    blanks[~filtered]=np.nan
    trimmed=bool_positions[filtered]
    indexer = positions[filtered]
    # subtract position of next non-nan from actual position
    blanks[indexer] = bools[trimmed] - indexer
    box.append(blanks)

np.column_stack(box)
array([[ 1.,  2.],
       [ 2.,  1.],
       [ 1.,  1.],
       [ 3.,  1.],
       [ 2., nan],
       [ 1., nan],
       [nan, nan]])

相关问题