numpy 以矢量形式计算在多个分组要素上选择的每个数据子集的统计数据

ioekq8ef  于 11个月前  发布在  其他
关注(0)|答案(1)|浏览(75)

考虑以下非向量化形式的操作代码

import numpy as np

N = 110
observables = np.random.rand(N)

I = np.random.randint(np.array([0,0])[:,None], np.array([4,5])[:,None], 
                     size=(2,N))

# you can think of of this as two groupings of the data with numbers
# indicating separate groups (one has 4 groups the other 5). I want 
# to compute then all the averages for each i,j where i is the index 
# from the first grouping and j the second

averages = np.zeros((4,5))

#unvectorized
for i in range(4):
 for j in range(5):
   J = np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten()
   averages[i,j] = np.mean(observables[J])

我想出了以下矢量化,但它对于大N非常低效,因为即使N=1000,最小公倍数也会增长到难以处理的大小

import math

#vectorized inefficiently
J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
     for j in range(5)]
lengths = [len(j) for j in J]

L = math.lcm(*lengths)

J = np.array([np.tile(j,int(L/len(j))) for j in J])

averages_vectorized = np.reshape(np.mean(observables[J], axis=1), (4,5))

有没有其他方法来矢量化这个?例如,像[0,1,2]这样的索引列表是否可以扩展为[0,1,2,sth,sth]这样的内容,以便当我尝试使用此索引列表访问numpy vector的元素时,sth不被考虑?
ps:
还有下面的方法,在两个列表解析中隐藏for循环

J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
     for j in range(5)]

averages_list_comrehension = np.reshape(np.array([np.mean(observables[j]) for j in J]), (4,5))

ps2:一种方法是在观测值的末尾添加一个nan值,然后用索引N扩展J的任何元素,直到所有元素都具有相同的大小,并使用np.nanmean。然而,我的最终目标是将其应用于PyTensor的Tensor环境中,所以不确定如何在那里做同样的事情(在这种情况下,可观的是标量Tensor,我不知道PyTensor中是否有nanTensor和nanmean)

j0pj023g

j0pj023g1#

更新:表示n+1维数组中的数据

如果数据和组的数量不大,我们可以在2分组的情况下将它们表示为3D数组,其中第一个轴是值的索引,最后两个轴是每个分组特征的组数。例如,如果第一个值x0属于组1和组2,则数组值arr[0, 1, 2]等于x0,而所有其他值arr[0, ...]为空。如果下一个x1值属于组2和组0,则arr[1, 2, 0]等于x1,而其他arr[1, ...] == nan,依此类推。

import numpy as np
from numpy.random import default_rng
rng = default_rng(0)

# parameters
N = 110
group_counts = (4, 5)

# init
values = rng.random(N)
groups = rng.integers(0, group_counts, size=(N, len(group_counts)))

# transform values into (n+1)-dimensional array, where n is number of grouping features
arr = np.full((N, *group_counts), np.nan, dtype=float)
arr[range(N), *groups.T] = values

# calculate mean values along axis 0 ignoring nan values
answer = np.nanmean(arr, axis=0)

在大数据的情况下,我们可以使用sparse arrays

from scipy import sparse

... # initialize  values and groups as previously

# mark each combination of grouping features by unique ID numbers
features = np.arange(np.prod(group_counts)).reshape(group_counts)

# create a compressed sparse column matrix 
# with the first index as indices along values
# and the second one is ID of feature combinations 
# for corresponding values 
arr = sparse.csc_array(
    (values, (np.arange(N), features[*groups.T])),
    shape=(N, np.prod(group_counts))
)

但在这种情况下,我们必须小心,因为空值将被替换为0,并在计算统计数据时考虑在内:

answer1 = arr.mean(axis=0).reshape(group_counts)
assert not np.allclose(answer, answer1)    # we obtain a different answer in this case

为了克服这个问题,我们必须重新实现利益的功能。在mean的情况下,我们可以尝试:

value_counts = np.diff(arr.indptr)
answer2 = (arr.sum(0) / value_counts).reshape(group_counts)
assert np.allclose(answer, answer2)

这里np.diff(arr.indptr)等于[x.nnz for x in arr.T],并返回每列中的值的数量(注意,应考虑显式放置在数组中的零)。查看有关nnzindptr的详细信息
或者,我们可以尝试n-dimentional sparse arrays

import sparse

arr = sparse.COO([range(N), *groups.T], values, shape=(N, *group_counts), fill_value=np.nan)
answer = sparse.nanmean(arr, axis=0).todense()

上一个答案:拆分排序的组组合

对组组合进行排序和拆分:

def mean_by_groups(values, groups):
    # ensure it can work with structures other then numpy.ndarray
    values = np.asarray(values)
    groups = np.atleast_2d(np.asarray(groups))
    
    # ensure we have group mappings along the last value axis 
    assert values.shape[-1] == groups.shape[-1], 'Inappropriate shape of group mappings'
    assert (groups >= 0).all(), 'group numbering should start from 0'
    
    length = values.shape[-1]
    groups_count = 1 + groups.max(-1)
    # by default, lexical sorting is performed on the last axis
    sorted_by_group = np.lexsort(groups)
    # determine where combinations of group mappings are changing
    # here prepend=-1 to include the first combination (group number != -1)
    diff_from_prev = np.diff(groups[:, sorted_by_group], prepend=-1).any(0)
    bins = np.arange(length)[diff_from_prev]
    coordinates = np.transpose(groups[:, sorted_by_group][:, bins])
    # here we use bins[1:] to avoid splitting in front of the first value
    value_groups = np.split(values[sorted_by_group], bins[1:])
    ans = np.full(shape=groups_count, fill_value=np.nan)
    for group_combination, arr in zip(coordinates, value_groups):
        # here we use tuple to address a single cell
        ans[tuple(group_combination)] = arr.mean()
        
    return ans

answer = mean_by_groups(observables, I)

按照这种方式,我们可以有两个以上的组Map。也可以考虑numpy.full而不是numpy.zeros。我们可以有在Map中不存在的群组合,并且应该保持nan而不是0

相关问题