c++ 计算CUDA数组中数字的出现次数

qvsjd97n 于 2023-05-20 发布在其他

关注(0)|答案(4)|浏览(224)

我有一个用CUDA存储在GPU上的无符号整数数组（通常是1000000元素）。我想计算数组中每个数字的出现次数。只有几个不同的数字（大约10），但这些数字可以从1到1000000。大约第9/10个数字是0，我不需要他们的计数。结果如下所示：

58458 -> 1000 occurrences
15 -> 412 occurrences

我有一个使用atomicAdd s的实现，但是它太慢了（很多线程都写同一个地址）。有人知道一个快速/有效的方法吗？

c++

来源：https://stackoverflow.com/questions/7573900/counting-occurrences-of-numbers-in-a-cuda-array

4条答案

按热度按时间

fslejnso1#

您可以通过首先对数字进行排序，然后进行关键字缩减来实现直方图。
最直接的方法是使用thrust::sort，然后使用thrust::reduce_by_key。它通常也比基于原子的临时装箱快得多。这是一个example。

赞(0）回复(0）举报 2023-05-20

hi3rlvi22#

我想您可以在CUDA示例中找到帮助，特别是直方图示例。它们是GPU计算SDK的一部分。你可以在www.example.com找到它http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram，他们甚至有一份解释算法的白皮书。

赞(0）回复(0）举报 2023-05-20

hlswsv353#

我在比较重复问题thrust count occurence中提出的两种方法，即：
1.使用thrust::counting_iterator和thrust::upper_bound，遵循直方图Thrust示例;
1.使用thrust::unique_copy和thrust::upper_bound。
下面，请看一个完整的例子。

#include <time.h>       // --- time
#include <stdlib.h>     // --- srand, rand
#include <iostream>

#include <thrust\host_vector.h>
#include <thrust\device_vector.h>
#include <thrust\sort.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\unique.h>
#include <thrust/binary_search.h>
#include <thrust\adjacent_difference.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

//#define VERBOSE
#define NO_HISTOGRAM

/********/
/* MAIN */
/********/
int main() {

    const int N = 1048576;
    //const int N = 20;
    //const int N = 128;

    TimingGPU timerGPU;

    // --- Initialize random seed
    srand(time(NULL));

    thrust::host_vector<int> h_code(N);

    for (int k = 0; k < N; k++) {
        // --- Generate random numbers between 0 and 9
        h_code[k] = (rand() % 10);
    }

    thrust::device_vector<int> d_code(h_code);
    //thrust::device_vector<unsigned int> d_counting(N);

    thrust::sort(d_code.begin(), d_code.end());

    h_code = d_code;

    timerGPU.StartCounter();

#ifdef NO_HISTOGRAM
    // --- The number of d_cumsum bins is equal to the maximum value plus one
    int num_bins = d_code.back() + 1;

    thrust::device_vector<int> d_code_unique(num_bins);
    thrust::unique_copy(d_code.begin(), d_code.end(), d_code_unique.begin());
    thrust::device_vector<int> d_counting(num_bins);
    thrust::upper_bound(d_code.begin(), d_code.end(), d_code_unique.begin(), d_code_unique.end(), d_counting.begin());  
#else
    thrust::device_vector<int> d_cumsum;

    // --- The number of d_cumsum bins is equal to the maximum value plus one
    int num_bins = d_code.back() + 1;

    // --- Resize d_cumsum storage
    d_cumsum.resize(num_bins);

    // --- Find the end of each bin of values - Cumulative d_cumsum
    thrust::counting_iterator<int> search_begin(0);
    thrust::upper_bound(d_code.begin(), d_code.end(), search_begin, search_begin + num_bins, d_cumsum.begin());

    // --- Compute the histogram by taking differences of the cumulative d_cumsum
    //thrust::device_vector<int> d_counting(num_bins);
    //thrust::adjacent_difference(d_cumsum.begin(), d_cumsum.end(), d_counting.begin());
#endif

    printf("Timing GPU = %f\n", timerGPU.GetCounter());

#ifdef VERBOSE
    thrust::host_vector<int> h_counting(d_counting);
    printf("After\n");
    for (int k = 0; k < N; k++) printf("code = %i\n", h_code[k]);
#ifndef NO_HISTOGRAM
    thrust::host_vector<int> h_cumsum(d_cumsum);
    printf("\nCounting\n");
    for (int k = 0; k < num_bins; k++) printf("element = %i; counting = %i; cumsum = %i\n", k, h_counting[k], h_cumsum[k]);
#else
    thrust::host_vector<int> h_code_unique(d_code_unique);

    printf("\nCounting\n");
    for (int k = 0; k < N; k++) printf("element = %i; counting = %i\n", h_code_unique[k], h_counting[k]);
#endif
#endif
}

第一种方法被证明是最快的。在NVIDIA GTX 960卡上，我对许多N = 1048576数组元素进行了以下计时：

First approach: 2.35ms
First approach without thrust::adjacent_difference: 1.52
Second approach: 4.67ms

请注意，没有严格的必要显式计算相邻差，因为如果需要，可以在内核处理期间手动完成此操作。

赞(0）回复(0）举报 2023-05-20

ve7v8dk24#

正如其他人所说，您可以使用sort & reduce_by_key方法来计算频率。在我的例子中，我需要获取数组的模式（最大频率/出现次数），所以这里是我的解决方案：
1 -首先，我们创建两个新数组，一个包含输入数据的副本，另一个填充1以稍后减少它（求和）：

// Input: [1 3 3 3 2 2 3]
// *(Temp) dev_keys: [1 3 3 3 2 2 3]
// *(Temp) dev_ones: [1 1 1 1 1 1 1]

// Copy input data
thrust::device_vector<int> dev_keys(myptr, myptr+size);

// Fill an array with ones
thrust::fill(dev_ones.begin(), dev_ones.end(), 1);

然后，我们对键进行排序，因为reduce_by_key函数需要对数组进行排序。

// Sort keys (see below why)
thrust::sort(dev_keys.begin(), dev_keys.end());

3 -稍后，我们为（唯一）密钥及其频率创建两个输出向量：

thrust::device_vector<int> output_keys(N);
thrust::device_vector<int> output_freqs(N);

4 -最后，我们通过键执行约简：

// Reduce contiguous keys: [1 3 3 3 2 2 3] => [1 3 2 1] Vs. [1 3 3 3 3 2 2] => [1 4 2] 
thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end;
new_end = thrust::reduce_by_key(dev_keys.begin(), dev_keys.end(), dev_ones.begin(), output_keys.begin(), output_freqs.begin());

5 -...如果我们愿意，我们可以得到最频繁的元素

// Get most frequent element
// Get index of the maximum frequency
int num_keys = new_end.first  - output_keys.begin();
thrust::device_vector<int>::iterator iter = thrust::max_element(output_freqs.begin(), output_freqs.begin() + num_keys);
unsigned int index = iter - output_freqs.begin();

int most_frequent_key = output_keys[index];
int most_frequent_val = output_freqs[index];  // Frequencies

赞(0）回复(0）举报 2023-05-20

我来回答

c++ 计算CUDA数组中数字的出现次数

4条答案

相关问题

热门标签

最新问答