c++ 如何在Thrust中对2D数据进行一维缩减

wmvff8tz  于 2023-01-14  发布在  其他
关注(0)|答案(1)|浏览(138)

我是CUDA和thrust库的新手。我正在学习并尝试实现一个函数,该函数将使用for循环执行thrust函数。有没有办法将此循环转换为另一个thrust函数?或者我应该使用CUDA内核来实现这一点?
我想出了这样的代码

// thrust functor
struct GreaterthanX
{
    const float _x;
    GreaterthanX(float x) : _x(x) {}

    __host__ __device__ bool operator()(const float &a) const
    {
        return a > _x;
    }
};

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int row = 3;
    int col = 4;
    thrust::device_vector<int> vec(row * col);
    thrust::device_vector<int> count(row);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec
    for (int i = 0; i < row; i++)
    {
        count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
    }

    thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
    int max_val = *result;
    unsigned int position = result - count.begin();

    printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

我的目标是找到具有最多大于2的元素的行。我正在努力不使用循环来完成此操作。任何建议都将非常感谢。谢谢。

bf1o4zei

bf1o4zei1#

使用推力的解决方案

下面是一个结合使用thrust::reduce_by_key和多个“花式迭代器”的实现。
我还自由地加入了一些constauto和lambdas,以保证优雅和可读性,由于lambdas的原因,您需要对nvcc使用-extended-lambda标志。
thrust::distance是减去Thrust迭代器的规范方法。

#include <cassert>
#include <cstdio>

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int const row = 3;
    int const col = 4;
    thrust::device_vector<int> vec(row * col);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;
    thrust::device_vector<int> count(row);

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec

    // counting iterator avoids read from global memory, gives index into vec
    auto keys_in_begin = thrust::make_counting_iterator(0);
    auto keys_in_end = thrust::make_counting_iterator(row * col);
    
    // transform vec on the fly
    auto vals_in_begin = thrust::make_transform_iterator(
        vec.cbegin(), 
        [] __host__ __device__ (int val) { return val > 2 ? 1 : 0; });
    
    // discard to avoid write to global memory
    auto keys_out_begin = thrust::make_discard_iterator();
    
    auto vals_out_begin = count.begin();
    
    // transform keys (indices) into row indices and then compare
    // the divisions are one reason one might rather
    // use MatX for higher dimensional data
    auto binary_predicate = [col] __host__ __device__ (int i, int j){
        return i / col == j / col;
    };
    
    // this function returns a new end for count 
    // b/c the final number of elements is often not known beforehand
    auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
                                         vals_in_begin,
                                         keys_out_begin,
                                         vals_out_begin,
                                         binary_predicate);
    // make sure that we didn't provide too small of an output vector
    assert(thrust::get<1>(new_ends) == count.end());

    auto const result = thrust::max_element(count.begin(), count.end());
    int const max_val = *result;
    auto const position = thrust::distance(count.begin(), result);

    std::printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}
使用MatX的额外解决方案

正如评论中提到的,NVIDIA发布了一个新的高级C++17库,名为MatX,针对以下问题(密集)多维数据该库试图统一多个低级库,如CUFFT,CUSOLVER和CUTLASS在一个类似python/matlab的界面中。(v0.2.2)该库仍处于初始开发阶段,因此可能无法保证稳定的API。因此,性能不如更成熟的Thrust库优化,文档/示例也不完全,MatX不应该在产品代码中使用。在构建这个解决方案的时候,我发现了一个bug,这个bug很快就被修复了。所以这个代码只能在主分支上工作,不能在当前版本v0.2.2中使用,并且一些使用的特性可能还没有出现在documentation中。
使用MatX的解决方案如下所示:

#include <iostream>
#include <matx.h>

int main(void)
{
    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    // tensor.Print(0,0); // print full tensor

    auto count = matx::make_tensor<int, 1>({row});
    // count.Print(0); // print full count

    // Goal: For each row, count the number of elements greater than 2.
    // And then find the row with the max count

    // the kind of reduction is determined through the shapes of tensor and count
    matx::sum(count, matx::as_int(tensor > 2));

    // A single value (scalar) is a tensor of rank 0: 
    auto result_idx = matx::make_tensor<matx::index_t>();
    auto result = matx::make_tensor<int>();
    matx::argmax(result, result_idx, count);

    cudaDeviceSynchronize();
    std::cout << "result = " << result() 
              << " at position " << result_idx() << "\r\n";
    // result = 4 at position 2

    return 0;
}

由于MatX使用延迟执行操作符,matx::as_int(tensor > 2)被有效地融合到内核中,实现了与在Thrust中使用thrust::transform_iterator相同的效果。
由于MatX知道问题的规律性,而Thrust不知道,MatX的解决方案可能比Thrust的解决方案性能更好,当然也更优雅。还可以在已分配的内存中构建Tensor。因此可以混合库,例如我通过将thrust::raw_pointer_cast(vec.data())传递给Tensor的构造器来在名为vecthrust::vector的存储器中构造Tensor。

相关问题