numpy 完全矢量化分箱数据的总和(使用预先计算的分箱索引)

6ojccjat 于 2023-06-29 发布在其他

关注(0)|答案(1)|浏览(98)

假设我有一个时间序列（t），其中有多个可观测量（a和b）：

import numpy as np

t = np.linspace(0, 10, 100)
a = np.random.normal(loc=5, scale=0.1, size=t.size)
b = np.random.normal(loc=1, scale=0.5, size=t.size)

我想得到时间仓的平均值，例如：

bin_edges = np.linspace(0, 12, 12)
bin_index = np.digitize(t, bin_edges) - 1
a_binned = np.zeros(bin_edges.size - 1)
b_binned = np.zeros(bin_edges.size - 1)

for ibin in np.argwhere(np.bincount(bin_index) > 0).flatten():
    select = bin_index == ibin
    a_binned[ibin] = np.mean(a[select])
    b_binned[ibin] = np.mean(b[select])

我的问题：我怎样才能使循环向量化呢？

numpy

来源：https://stackoverflow.com/questions/76574134/fully-vectorise-sum-of-binned-data-using-pre-computed-bin-index

1条答案

按热度按时间

t40tm48m1#

如果你不介意使用pandas：

import pandas as pd
df = pd.DataFrame({'a':a,'b':b, 'index':bin_index})
gp = df.groupby('index').mean()

你可以像这样提取np数组：

a_binned = gp.a.values
b_binned = gp.b.values

只支持numpy：

另一种解决方案，稍微复杂一点，只使用numpy而不使用循环，但假设你的bin是有序的，并按整数计数进行索引（就像你的情况一样）：

indexes = np.argwhere(np.diff(bin_index, prepend=0, append=1)).flatten()
len_of_index = np.diff(indexes, prepend=0)
indexes = indexes -1 # array python index starts at 0

cumsum = np.cumsum(a)[indexes]
sum_groups = np.diff(cumsum, prepend=0)
means = sum_groups/len_of_index

这背后的想法是使累积和，然后差只是后者除以发生的次数。
最后，你也可以一次添加多个数组，并创建一个方便的函数：

array = np.array([a,b])
def binned_mean(array, bin_index):
    indexes = np.argwhere(np.diff(bin_index, prepend=bin_index[0], append=1)).flatten()
    len_of_index = np.diff(indexes, prepend=0)
    indexes = indexes -1 # array python index starts at 0

    cumsum = np.cumsum(array, axis=1)[:, indexes]
    sum_groups = np.diff(cumsum, prepend=0)
    
    # taking care of possible holes in the binning
    means = np.empty([np.shape(array)[0], max(bin_index)+1])
    means[:] = np.nan
    means[:, bin_index[indexes]] = sum_groups/len_of_index # bin_index[indexes] -> present idexes
    return means

binned_mean(array, bin_index)

numpy解决方案的速度要快得多（大约100倍），但pandas仍然更清晰，并且经过了战斗测试，特别是它没有做出我为numpy解决方案所做的假设。

赞(0）回复(0）举报 2023-06-29

我来回答

numpy 完全矢量化分箱数据的总和(使用预先计算的分箱索引)

1条答案

只支持numpy：

相关问题

热门标签

最新问答