numpy 交叉组ID,ID来自2个单独的数组,定义组的大小

eagi6jfj  于 2022-12-13  发布在  其他
关注(0)|答案(1)|浏览(87)

1.“问题是”
我有两个不同的函数,名为update()reset()。我在标签标题中将这些函数称为“ID”。它们依次应用于data数组中的连续行组。
应用这些函数的组的大小在相关数组中定义。

import numpy as np

# 'data' array, size 5.
data = np.array([1., 4., 3., 8., 9.], dtype='float64')

# Group sizes to which will apply 'update' function.
update_gs = np.array([1, 0, 2, 2], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'update' function is to be performed successively on:
# - the 1st row
# - then without any row from data
# - then with the 2nd and 3rd rows from data
# - then with the 4th and 5th rows from data

# Group sizes to which will apply 'reset' function.
reset_gs = np.array([2, 0, 0, 2, 1], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'reset' function is to be performed successively on:
# - the 2 1st rows
# - a 2nd and 3rd reset will be run without any row from data
# - a 4th reset will be run with 3rd and 4th rows from data
# - a 5th reset will be run with the last row from data

我从这个输入数据中寻找的结果是2个1D数组:

  • 这些结果阵列的每一行涉及x1M3 N1 X或x1M4 N1 X的一次出现。
  • 因此,这些阵列的大小为len(update_gs) + len(reset_gs),即这里为8
  • 一个数组是int,再次定义组大小。在此结果数组中,组大小定义为自上次出现updatereset以来“经过”的行数。
  • 另一个数组为bool,定义行是与reset函数(值True)还是与update函数(值False)相关
  • 关于updatereset出现的顺序:
  • updatereset出现的data中的行组重叠。考虑到它们各自行组中的最后一行,在每个行组之间来自后者的行也使相应出现(updatereset)成为后者。
  • 如果updatereset的行组共享同一最后一行,则update在结果数组中排在第一位。

根据之前的数据,预期结果为:

group_sizes = np.array([1, # 1st action is 'update'. It applies to 1st row in data.
                        0, # There is a 2nd 'update' on 0 row.
                        1, # At 2nd row of data, there is the 1st 'reset' action.
                        0, # There is a 2nd 'reset' on 0 row.
                        0, # There is a 3rd 'reset' on 0 row.
                        1, # There is the 3rd 'update' taking place, with 1 row elapsed since previous function.
                        1, # There is a 4th 'reset', with 1 row elapsed since previous function.
                        1, # There is the 4th 'update' taking place, with 1 row elapsed since previous function.
                        0, # Occurs finally the last 'reset', with same ending row in 'data' than the previous 'update'
                       ], dtype='int64')
# Sum of the values gives 5, the total number of rows in 'data'.                        

function_ids = np.array([False, # 1st action is 'update'.
                         False, # There is a 2nd 'update'.
                         True,  # There is the 1st 'reset' action.
                         True,  # There is a 2nd 'reset'.
                         True,  # There is a 3rd 'reset'.
                         False, # There is the 3rd 'update'.
                         True,  # There is a 4th 'reset'.
                         False, # There is the 3rd 'update'.
                         True,  # Occurs finally the last 'reset'.
                        ], dtype='bool')

1.可能是XY问题?
考虑到以下主题,提出了上述问题:

  • 我将上面提到的两个数组reset_gsupdate_gs作为输入。相关函数(updatereset)的工作方式取决于对前一组(resetupdate?)应用了什么函数及其结果。
  • 由于这个原因,我首先尝试在2个for循环中交错各自的调用。这导致了一个复杂的代码,我还没有成功地使工作。我相信一段时间后,它可能是可能的,与几个if的,缓冲变量和布尔标志来在2个交错的for循环之间传递先前的状态。这似乎确实不够。
  • 基于这个原因,我相信选择一个扁平的for循环是更好的。我正在寻找的2个结果数组(上面的问题)将使我能够选择这样的解决方案。

1.data上循环如何?
data的大小是几百万行。update_gs的大小是几千行。reset_gs的大小从几百行到几千行不等。
在性能方面,我有理由相信,在update_gsreset_gs(即组定义--数千次迭代)上循环,而不是在data(每行单独--数百万次迭代)上循环,将使代码速度更快。

nfs0ujit

nfs0ujit1#

这实际上变成了“如何进行合并排序?”。我发现了关于sortednp包的方式,这似乎是最快的方式。

import numpy as np
from sortednp import merge # merge of sorted numpy arrays

# Input data.
update = np.array([1, 0, 2, 2], dtype='int64')
reset = np.array([2, 0, 0, 2, 1], dtype='int64')

# Switching from group sizes to indices of group last row, in-place.
np.cumsum(update, out=update)
np.cumsum(reset, out=reset)
# Performing a merge of sorted arrays, keeping insertion index for 'update'.
merged_idx, (update_idx, _) = merge(update, reset, indices=True)
# Going back to group sizes.
merged_gs = np.diff(merged_idx, prepend=0)
# Final step to get 'function_ids'.
function_ids = np.ones(len(merged_gs), dtype="bool")
function_ids[update_idx] = False

# Here we are.
merged_gs
Out[9]: array([1, 0, 1, 0, 0, 1, 1, 1, 0])
function_ids
Out[13]: array([False, False,  True,  True,  True, False,  True, False,  True])

相关问题