numpy 交叉组ID，ID来自2个单独的数组，定义组的大小

1.“问题是”
我有两个不同的函数，名为update()和reset()。我在标签标题中将这些函数称为“ID”。它们依次应用于data数组中的连续行组。
应用这些函数的组的大小在相关数组中定义。

import numpy as np

# 'data' array, size 5.
data = np.array([1., 4., 3., 8., 9.], dtype='float64')

# Group sizes to which will apply 'update' function.
update_gs = np.array([1, 0, 2, 2], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'update' function is to be performed successively on:
# - the 1st row
# - then without any row from data
# - then with the 2nd and 3rd rows from data
# - then with the 4th and 5th rows from data

# Group sizes to which will apply 'reset' function.
reset_gs = np.array([2, 0, 0, 2, 1], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'reset' function is to be performed successively on:
# - the 2 1st rows
# - a 2nd and 3rd reset will be run without any row from data
# - a 4th reset will be run with 3rd and 4th rows from data
# - a 5th reset will be run with the last row from data

我从这个输入数据中寻找的结果是2个1D数组：

这些结果阵列的每一行涉及x1M3 N1 X或x1M4 N1 X的一次出现。
因此，这些阵列的大小为len(update_gs) + len(reset_gs)，即这里为8
一个数组是int，再次定义组大小。在此结果数组中，组大小定义为自上次出现update或reset以来“经过”的行数。
另一个数组为bool，定义行是与reset函数（值True）还是与update函数（值False）相关
关于update和reset出现的顺序：
update和reset出现的data中的行组重叠。考虑到它们各自行组中的最后一行，在每个行组之间来自后者的行也使相应出现（update和reset）成为后者。
如果update和reset的行组共享同一最后一行，则update在结果数组中排在第一位。

根据之前的数据，预期结果为：

group_sizes = np.array([1, # 1st action is 'update'. It applies to 1st row in data.
                        0, # There is a 2nd 'update' on 0 row.
                        1, # At 2nd row of data, there is the 1st 'reset' action.
                        0, # There is a 2nd 'reset' on 0 row.
                        0, # There is a 3rd 'reset' on 0 row.
                        1, # There is the 3rd 'update' taking place, with 1 row elapsed since previous function.
                        1, # There is a 4th 'reset', with 1 row elapsed since previous function.
                        1, # There is the 4th 'update' taking place, with 1 row elapsed since previous function.
                        0, # Occurs finally the last 'reset', with same ending row in 'data' than the previous 'update'
                       ], dtype='int64')
# Sum of the values gives 5, the total number of rows in 'data'.                        

function_ids = np.array([False, # 1st action is 'update'.
                         False, # There is a 2nd 'update'.
                         True,  # There is the 1st 'reset' action.
                         True,  # There is a 2nd 'reset'.
                         True,  # There is a 3rd 'reset'.
                         False, # There is the 3rd 'update'.
                         True,  # There is a 4th 'reset'.
                         False, # There is the 3rd 'update'.
                         True,  # Occurs finally the last 'reset'.
                        ], dtype='bool')

1.可能是XY问题？
考虑到以下主题，提出了上述问题：

我将上面提到的两个数组reset_gs和update_gs作为输入。相关函数（update或reset）的工作方式取决于对前一组（reset或update？）应用了什么函数及其结果。
由于这个原因，我首先尝试在2个for循环中交错各自的调用。这导致了一个复杂的代码，我还没有成功地使工作。我相信一段时间后，它可能是可能的，与几个if的，缓冲变量和布尔标志来在2个交错的for循环之间传递先前的状态。这似乎确实不够。
基于这个原因，我相信选择一个扁平的for循环是更好的。我正在寻找的2个结果数组（上面的问题）将使我能够选择这样的解决方案。

1.在data上循环如何？
data的大小是几百万行。update_gs的大小是几千行。reset_gs的大小从几百行到几千行不等。
在性能方面，我有理由相信，在update_gs和reset_gs（即组定义--数千次迭代）上循环，而不是在data（每行单独--数百万次迭代）上循环，将使代码速度更快。

这实际上变成了“如何进行合并排序？”。我发现了关于sortednp包的方式，这似乎是最快的方式。

import numpy as np
from sortednp import merge # merge of sorted numpy arrays

# Input data.
update = np.array([1, 0, 2, 2], dtype='int64')
reset = np.array([2, 0, 0, 2, 1], dtype='int64')

# Switching from group sizes to indices of group last row, in-place.
np.cumsum(update, out=update)
np.cumsum(reset, out=reset)
# Performing a merge of sorted arrays, keeping insertion index for 'update'.
merged_idx, (update_idx, _) = merge(update, reset, indices=True)
# Going back to group sizes.
merged_gs = np.diff(merged_idx, prepend=0)
# Final step to get 'function_ids'.
function_ids = np.ones(len(merged_gs), dtype="bool")
function_ids[update_idx] = False

# Here we are.
merged_gs
Out[9]: array([1, 0, 1, 0, 0, 1, 1, 1, 0])
function_ids
Out[13]: array([False, False,  True,  True,  True, False,  True, False,  True])

numpy 交叉组ID，ID来自2个单独的数组，定义组的大小

1条答案

相关问题

热门标签

最新问答