numpy 完全矢量化分箱数据的总和(使用预先计算的分箱索引)

6ojccjat  于 2023-06-29  发布在  其他
关注(0)|答案(1)|浏览(98)

假设我有一个时间序列(t),其中有多个可观测量(ab):

import numpy as np

t = np.linspace(0, 10, 100)
a = np.random.normal(loc=5, scale=0.1, size=t.size)
b = np.random.normal(loc=1, scale=0.5, size=t.size)

我想得到时间仓的平均值,例如:

bin_edges = np.linspace(0, 12, 12)
bin_index = np.digitize(t, bin_edges) - 1
a_binned = np.zeros(bin_edges.size - 1)
b_binned = np.zeros(bin_edges.size - 1)

for ibin in np.argwhere(np.bincount(bin_index) > 0).flatten():
    select = bin_index == ibin
    a_binned[ibin] = np.mean(a[select])
    b_binned[ibin] = np.mean(b[select])

我的问题:我怎样才能使循环向量化呢?

t40tm48m

t40tm48m1#

如果你不介意使用pandas:

import pandas as pd
df = pd.DataFrame({'a':a,'b':b, 'index':bin_index})
gp = df.groupby('index').mean()

你可以像这样提取np数组:

a_binned = gp.a.values
b_binned = gp.b.values

只支持numpy:

另一种解决方案,稍微复杂一点,只使用numpy而不使用循环,但假设你的bin是有序的,并按整数计数进行索引(就像你的情况一样):

indexes = np.argwhere(np.diff(bin_index, prepend=0, append=1)).flatten()
len_of_index = np.diff(indexes, prepend=0)
indexes = indexes -1 # array python index starts at 0

cumsum = np.cumsum(a)[indexes]
sum_groups = np.diff(cumsum, prepend=0)
means = sum_groups/len_of_index

这背后的想法是使累积和,然后差只是后者除以发生的次数。
最后,你也可以一次添加多个数组,并创建一个方便的函数:

array = np.array([a,b])
def binned_mean(array, bin_index):
    indexes = np.argwhere(np.diff(bin_index, prepend=bin_index[0], append=1)).flatten()
    len_of_index = np.diff(indexes, prepend=0)
    indexes = indexes -1 # array python index starts at 0

    cumsum = np.cumsum(array, axis=1)[:, indexes]
    sum_groups = np.diff(cumsum, prepend=0)
    
    # taking care of possible holes in the binning
    means = np.empty([np.shape(array)[0], max(bin_index)+1])
    means[:] = np.nan
    means[:, bin_index[indexes]] = sum_groups/len_of_index # bin_index[indexes] -> present idexes
    return means

binned_mean(array, bin_index)

numpy解决方案的速度要快得多(大约100倍),但pandas仍然更清晰,并且经过了战斗测试,特别是它没有做出我为numpy解决方案所做的假设。

相关问题