基于另一个数组中列的位置对numpy数组进行分组和求和

kupeojn6  于 2023-05-07  发布在  其他
关注(0)|答案(2)|浏览(159)

我的数据是什么样子

我用伦巴做了一些数值模拟。
输出是一组numpy数组;每个阵列表示一个度量,并且每个阵列具有形状(periods x items)
例如,metric_1[p,i]告诉我metric_1在时间p的值,用于项目i
每个项目属于某个类别-让我们说红色和绿色只是为了一个例子。一维数组categories正好告诉我这一点-例如。categories[0]='a'表示第一个项目属于类别a。从概念上讲,这就像将一个pandas多索引“展平”到另一个数组中。
我想做的是
1.我想按类别分组,并创建维度为(periods x categories)的数组metric_1_grouped,等等。
1.我想为每个类别创建一个数据框,并使用所有类别的总和创建一个数据框,其中每行是一个句点,每列是一个指标
这个问题本身是相当平庸的,但我的问题是**既然我必须多次这样做,那么什么是尽可能有效地完成这一任务的好方法呢?**一个典型的例子是:

  • 300期
  • 12个指标
  • 50万件商品
  • 6个类别

为什么我觉得这个问题不重复

我知道有几个问题在问numpy中是否有一个等价的groupby,例如。Is there any numpy group by function?,但它们是不同的,因为它们都由数组本身的元素分组。这不是我想做的-我需要分组,是的,但不是通过数组本身的任何元素,而是通过将列号与另一个数组相匹配。
有一些问题提到总结的基础上的立场,但如果我理解他们正确,他们不像我的情况下,例如。Map numpy array and sum values on positions in another arraysum array with condition in another array with numpy

  • 请,请,非常请,不要关闭这个问题,除非你确定它已经在其他地方回答-有很多问题听起来相似,但不是 *。谢谢大家。

可能的解决方案?

  • pandas Dataframe 多索引-但我担心它可能会慢得多
  • itertools groupby?我承认我不是很熟悉

我试过的是--它能用,但不优雅,有点笨重
我下面的代码工作,但不优雅,有点笨重。我希望有一个更好/更优雅/更快的版本?

import numpy as np
import pandas as pd

num_periods = 300
num_items = 1000
# Let's suppose for simplicity that the data has already been sorted by category
categories = np.empty(num_items, dtype=object)
categories[0:100]='a'
categories[100:300]='b'
categories[300:600]='c'
categories[600:]='d'

rng = np.random.default_rng(seed=42) #setting a seed for reproducibility
metric_1 = rng.normal(0,1,(num_periods,num_items))
metric_2 = rng.uniform(0,1,(num_periods,num_items)) 
unique_categories = np.unique(categories)
num_categories=len(unique_categories)

where_to_split  = np.unique(categories, return_index=True)[1][1:]
#  The second item of the tuple returned by np.unique is an array with the
# indices of the categores (which, remember, we had already sorted - this is
# a requirement),
# so it will be: [0, 100, 300. 600]
# so where_to_split is an array which is [100, 300, 600]

metric_1_list = np.split(metric_1, where_to_split, axis=1)
metric_1_by_category = np.zeros((num_periods, num_categories))
for i in range(len(metric_1_list)):
    metric_1_by_category[:,i] = metric_1_list[i].sum(axis=1)
    
metric_2_list = np.split(metric_2, where_to_split, axis=1)
metric_2_by_category = np.zeros((num_periods, num_categories))
for i in range(len(metric_2_list)):
    metric_2_by_category[:,i] = metric_2_list[i].sum(axis=1)
    
# we now create a dictionary of dataframes
# df_by_cat['a'] will be the dataframe for categiry a, etc    
df_by_cat = {}
for my_count, my_val in enumerate(unique_categories):
    df_by_cat[my_val] = pd.DataFrame(index = np.arange(0,num_periods), columns=['metric 1','metric 2'])
    df_by_cat[my_val]['metric 1'] = metric_1_by_category[:,my_count]
    df_by_cat[my_val]['metric 2'] = metric_2_by_category[:,my_count]
rggaifut

rggaifut1#

也许你可以用途:

from itertools import zip_longest

dfs = {}
cats, idx = np.unique(categories, return_index=True)
for cat, i, j in zip_longest(cats, idx, idx[1:], fillvalue=num_items):
    data = {'metric 1': metric_1[..., i:j].sum(axis=1),
            'metric 2': metric_2[..., i:j].sum(axis=1)}
    dfs[cat] = pd.DataFrame(data)

输出:

>>> dfs['b']
      metric 1    metric 2
0    -7.296710  104.931030
1    -9.827326  105.882775
2     5.355780  100.459972
3     2.047677  102.027632
4    19.013256  100.234287
..         ...         ...
295  -0.917488  106.399000
296 -10.455586  103.749010
297  -4.187631  103.213497
298   3.307818   99.927642
299   9.137808   99.497051

[300 rows x 2 columns]

您可以使用pd.concat连接所有 Dataframe :

>>> pd.concat(dfs, axis=1)
             a                     b                      c                      d            
      metric 1   metric 2   metric 1    metric 2   metric 1    metric 2   metric 1    metric 2
0    -5.026961  54.575700  -7.296710  104.931030  -3.375551  149.866571 -13.192328  210.539154
1     5.278172  48.128531  -9.827326  105.882775 -22.063953  155.756189 -54.765248  196.552705
2    -3.108761  52.650639   5.355780  100.459972  22.150701  150.443303   9.303484  199.897130
3     5.986694  49.389781   2.047677  102.027632  12.479241  155.644543 -21.972627  201.381321
4     0.229614  49.621222  19.013256  100.234287  -4.025634  147.322897 -36.574717  204.282579
..         ...        ...        ...         ...        ...         ...        ...         ...
295   8.319109  53.144215  -0.917488  106.399000  27.867494  147.736396   4.127136  207.730258
296  14.944851  48.941017 -10.455586  103.749010  11.772255  141.037684 -20.975950  193.678342
297 -13.453457  50.234486  -4.187631  103.213497 -21.091565  157.424556  15.851561  203.338274
298 -13.933062  48.845265   3.307818   99.927642   7.864996  152.657109 -31.559198  200.892997
299  -8.041795  43.714031   9.137808   99.497051  -4.100661  147.830122  11.852521  197.667950

[300 rows x 8 columns]

和玩.T.stackpivot等来重塑你的 Dataframe 。

kr98yfug

kr98yfug2#

我认为Pandas在不平衡类别上的groupby比np. split更好。您可以使用for循环对单独的指标进行groupby,因为您只有相对少量的指标。在这种情况下,您实际上不需要多索引。或者,你可以将所有数据连接到一个多索引的 Dataframe 中,并执行groupby,并拥有一个集中的 Dataframe ,而不是它们的列表/字典。
让我们尝试第二种方法:

## begin sample data
num_periods = 300
num_items = 1000
# Let's suppose for simplicity that the data has already been sorted by category
categories = np.empty(num_items, dtype=object)
categories[0:100]='a'
categories[100:300]='b'
categories[300:600]='c'
categories[600:]='d'

rng = np.random.default_rng(seed=42) #setting a seed for reproducibility
metric_1 = rng.normal(0,1,(num_periods,num_items))
metric_2 = rng.uniform(0,1,(num_periods,num_items)) 

### end sample data

### the metric dict:
metrics = {'metric1': metric_1, 'metric2': metric_2}

out = pd.concat({
    k: pd.DataFrame(v.T).assign(cat=categories) for k,v in metrics.items()
}).set_index('cat', append=True).groupby(level=[0,-1]).sum()

那么前5个周期看起来像这样:

0           1           2           3           4
        cat                                                            
metric1 a     -5.026961    5.278172   -3.108761    5.986694    0.229614
        b     -7.296710   -9.827326    5.355780    2.047677   19.013256
        c     -3.375551  -22.063953   22.150701   12.479241   -4.025634
        d    -13.192328  -54.765248    9.303484  -21.972627  -36.574717
metric2 a     54.575700   48.128531   52.650639   49.389781   49.621222
        b    104.931030  105.882775  100.459972  102.027632  100.234287
        c    149.866571  155.756189  150.443303  155.644543  147.322897
        d    210.539154  196.552705  199.897130  201.381321  204.282579

现在,假设您想要metric_1的数据,您可以执行以下操作:

out.loc['metric1']

然后你会得到(前5个周期):

0          1          2          3          4
cat                                                       
a    -5.026961   5.278172  -3.108761   5.986694   0.229614
b    -7.296710  -9.827326   5.355780   2.047677  19.013256
c    -3.375551 -22.063953  22.150701  12.479241  -4.025634
d   -13.192328 -54.765248   9.303484 -21.972627 -36.574717

更新:由于您的项目相对较多,指标较小,您可以对每个指标进行分组,如:

out = pd.concat({
    k: pd.DataFrame(v).groupby(categories, axis=1).sum()
    for k,v in metrics.items()
})

然后out看起来像:

a           b           c           d
metric1 0    -5.026961   -7.296710   -3.375551  -13.192328
        1     5.278172   -9.827326  -22.063953  -54.765248
        2    -3.108761    5.355780   22.150701    9.303484
        3     5.986694    2.047677   12.479241  -21.972627
        4     0.229614   19.013256   -4.025634  -36.574717
...                ...         ...         ...         ...
metric2 295  53.144215  106.399000  147.736396  207.730258
        296  48.941017  103.749010  141.037684  193.678342
        297  50.234486  103.213497  157.424556  203.338274
        298  48.845265   99.927642  152.657109  200.892997
        299  43.714031   99.497051  147.830122  197.667950

并且用于特定度量的数据可以是:

out.loc['metric1']

输出:

a          b          c          d
0    -5.026961  -7.296710  -3.375551 -13.192328
1     5.278172  -9.827326 -22.063953 -54.765248
2    -3.108761   5.355780  22.150701   9.303484
3     5.986694   2.047677  12.479241 -21.972627
4     0.229614  19.013256  -4.025634 -36.574717
..         ...        ...        ...        ...
295   8.319109  -0.917488  27.867494   4.127136
296  14.944851 -10.455586  11.772255 -20.975950
297 -13.453457  -4.187631 -21.091565  15.851561
298 -13.933062   3.307818   7.864996 -31.559198
299  -8.041795   9.137808  -4.100661  11.852521

更新2:numpy groupby功能在您的示例中是np.sum.reduce_at

# notice the missing [1:]
where_to_split  = np.unique(categories, return_index=True)[1]

out = {k: np.add.reduceat(metric_1, where_to_split, axis=1) 
          for k, v in metrics.items()
      }

相关问题