我有两个3D（GPU）数组A和B，它们具有相同的第二维和第三维。例如，size(A,1) = 300 000、size(B,1) = 2000、size(A,2) = size(B,2) = 20和size(A,3) = size(B,3) = 100，以给予数量级的概念。实际上，size(A,3) = size(B,3)非常大，例如~ 1 000 000，但是数组是以沿着第三维切割的小块存储在外部的。（见下面的cfg. MWE），因此其中的代码需要进一步优化此外，A和B的值可以被假设为在65535以下的方式被限定，但是仍然存在数百个不同的值。
对于每个i、j和d，行A(i,:,d)和B(j,:,d)表示相同大小的多重集，我需要找到这两个多重集的最大公共子多重集（多重子集？）的大小，即它们的交集 * 作为多重集*的大小。此外，可以假定B的行是已排序的。
例如，如果[2 3 2 1 4 5 5 5 6 7]和[1 2 2 3 5 5 7 8 9 11]分别是两个这样的多重集，则它们的 * 多重集交集 * 是[1 2 2 3 5 5 7]，大小为7（7个元素作为一个多重集）。
我目前正在使用以下例程来执行此操作：

s = 300000; % 1st dim. of A
n = 2000; % 1st dim. of B
c = 10; % 2nd dim. of A and B
depth = 10; % 3rd dim. of A and B (corresponds to a batch of size 10 of A and B along the 3rd dim.)
N = 100; % upper bound on the possible values of A and B

A = randi(N,s,c,depth,'uint16','gpuArray');
B = randi(N,n,c,depth,'uint16','gpuArray');

Sizes_of_multiset_intersections = zeros(s,n,depth,'uint8'); % too big to fit in GPU memory together with A and B
for d=1:depth
    A_slice = A(:,:,d);
    B_slice = B(:,:,d);
    unique_B_values = permute(unique(B_slice),[3 2 1]); % B is smaller than A

    % compute counts of the unique B-values for each multiset:
    A_values_counts = permute(sum(uint8(A_slice==unique_B_values),2,'native'),[1 3 2]);
    B_values_counts = permute(sum(uint8(B_slice==unique_B_values),2,'native'),[1 3 2]);

    % compute the count of each unique B-value in the intersection:
    Sizes_of_multiset_intersections_tmp = gpuArray.zeros(s,n,'uint8');
    for i=1:n
        Sizes_of_multiset_intersections_tmp(:,i) = sum(min(A_values_counts,B_values_counts(i,:)),2,'native');
    end

    Sizes_of_multiset_intersections(:,:,d) = gather(Sizes_of_multiset_intersections_tmp);
end

也可以容易地修改上述代码以沿着维度3而不是d=1:depth（=大小为1的批）批处理计算结果，尽管代价是甚至更大的unique_B_values向量。
由于depth维度很大（即使是在批量沿着时），我对外部循环中的代码更快的替代品很感兴趣。所以我的问题是：是否有更快（例如更好的矢量化）的方法来计算相同大小的多重集的交集大小？

tic nBatches_B = 2000; sBatches_B = n/nBatches_B; Sizes_of_multiset_intersections_new = zeros(s,n,depth,'uint8'); for d=1:depth A_slice = A(:,:,d); B_slice = B(:,:,d); % compute counts of the unique B-values for each multiset: unique_B_values = reshape(unique(B_slice),1,1,[]); A_values_counts = sum(uint8(A_slice==unique_B_values),2,'native'); % s x 1 x length(uniqueB) array B_values_counts = reshape(sum(uint8(B_slice==unique_B_values),2,'native'),1,n,[]); % 1 x n x length(uniqueB) array % Not possible to do it all in one go, must split in batches along B for ii = 1:nBatches_B Sizes_of_multiset_intersections_new(:,((ii-1)*sBatches_B+1):ii*sBatches_B,d) = sum(min(A_values_counts,B_values_counts(:,((ii-1)*sBatches_B+1):ii*sBatches_B,:)),3,'native'); % Vectorized end end toc

1条答案

按热度按时间

4ktjp1zp1#

免责声明：这不是一个基于GPU的解决方案（没有一个好的GPU）。我觉得结果很有趣，想分享一下，但如果你认为应该这样做，我可以删除这个答案。

下面是代码的矢量化版本，它可以摆脱内部循环，代价是必须处理一个更大的数组，这个数组可能太大而无法放入内存。
我们的想法是让矩阵A_values_counts和B_values_counts是3D矩阵，其形状使得调用min(A_values_counts,B_values_counts)将由于 * 隐式扩展 * 而一次计算所有内容。在后台，它将创建一个大小为s x n x length(unique_B_values)的大数组（可能大多数时候太大了）
为了绕过对大小的约束，沿着n维度（即，B的第一维）分批计算结果：

下面是一个小的基准测试，它使用了不同的批处理数量值。您可以看到，最小值大约为400（批处理大小为50），处理时间减少了大约10%（每个点是3次运行的平均值）。（编辑：x轴为批次数量，而非批次大小）

我也很想知道它在GPU阵列上的表现！

赞(0）回复(0）举报 2022-11-24

matlab 从两个3D多重集数组中查找任意两个对应多重集交集大小的快速方法

1条答案

相关问题

热门标签

最新问答