Cuda是否提供了一种方法来使用主机上的可分页内存在不同设备之间获得兼容性?(注意,这不是关于在单个设备上的可分页内存上阻止cudaMemcpyAsync
的永无止境的问题)
以下是背景:在访问一个具有多个Cuda设备的节点时,我将工作负载批处理,以便在所有设备之间均匀分布。一个批处理包括num_devices
示例(每个设备一个)从主机到设备的cudaMemcpyAsync
,然后是内核启动,最后是从设备到主机的cudaMemcpyAsync
。这些num_devices
示例是从设备上的for
循环启动的。但问题如下:由于我使用的主机内存是可分页的,cudaMemcpyAsync
与主机同步,因此最里面的for
循环中的每次迭代都发生在前一次完全完成的情况下,从而防止所有设备同时工作。
我在下面附上一个最小的例子:
/*
Compilation: nvcc main.cu -o main.cuda
(nvcc version 12)
*/
#include <cuda_runtime.h>
#include <vector>
#include <stdint.h>
#include <cassert>
// trivial kernel for illustration
__global__
void kernel(double* d_u, const uint64_t size)
{
uint64_t j = blockIdx.x * blockDim.x + threadIdx.x;
if (j<size) {
d_u[j] *= 2.0;
}
}
// providing home-made popcnt in case __builtin_popcount is not supported
unsigned int hm_popcnt(int word) {
unsigned int n = 0;
while(word) {
if (word&1) {++n;}
word>>=1;
}
return n;
}
int main() {
unsigned int n = 30;
uint64_t dimension = (1ULL)<<n;
unsigned int n0 = 27;
uint64_t batch_size = (1ULL)<<n0;
int blockSize = 256;
int numBlocks = (batch_size + blockSize - 1)/blockSize;
int num_devices;
cudaGetDeviceCount(&num_devices);
assert(num_devices!=1); // 1 device, no luck
assert(__builtin_popcount(num_devices)==1); // for sake of simplicity
//assert(hm_popcnt(num_devices)==1);
cudaStream_t streams[num_devices];
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
cudaStreamCreateWithFlags(&streams[dev_id], cudaStreamNonBlocking);
}
std::vector<double> h_v(dimension, 1.0); // pageable memory
// each device holds its array of dimension <batch_size>
double * d_v[num_devices];
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
cudaMalloc((void**)&d_v[dev_id], batch_size*sizeof(double));
}
uint32_t num_batches = ((1UL)<<(n-n0))/num_devices;
for (uint32_t i=0; i<num_batches; ++i) {
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
uint64_t start_index = (i*num_devices + dev_id) * batch_size;
cudaMemcpyAsync(d_v[dev_id], &h_v[start_index], batch_size*sizeof(double), cudaMemcpyHostToDevice, streams[dev_id]);
kernel<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_v[dev_id], batch_size);
cudaMemcpyAsync(&h_v[start_index], d_v[dev_id], batch_size*sizeof(double), cudaMemcpyDeviceToHost, streams[dev_id]);
// h_v is on pageable memory, so the cudaMemcpyAsync is synchronous, preventing devices to work at the same time
}
}
for (int dev_id=0; dev_id<num_devices; ++dev_id)
{
cudaSetDevice(dev_id);
cudaFree(d_v[dev_id]);
cudaStreamDestroy(streams[dev_id]);
}
return 0;
}
字符串
由于dimension
在生产环境中非常大,因此无法从可分页内存到固定内存执行主机到主机的复制。相反,我可以通过交换外循环和内循环并在设备上使用OpenMP for
循环来解决这个问题(代价是涉及num_devices
CPU线程,这很好):
// requires to add #include "omp.h" and to compile with flags -Xcompiler -fopenmp
#pragma omp parallel for schedule(static, 1) num_threads(num_devices)
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
for (uint32_t i=0; i<num_batches; ++i) {
uint64_t start_index = (i*num_devices + dev_id) * batch_size;
cudaMemcpyAsync(d_v[dev_id], &h_v[start_index], batch_size*sizeof(double), cudaMemcpyHostToDevice, streams[dev_id]);
kernel<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_v[dev_id], batch_size);
cudaMemcpyAsync(&h_v[start_index], d_v[dev_id], batch_size*sizeof(double), cudaMemcpyDeviceToHost, streams[dev_id]);
}
}
型
IMO这不是很优雅,不知何故,我觉得Cuda应该提供一些更干净的东西,以实现在这个简单的场景多设备兼容性.是这样吗?如果不是,有其他解决方案吗?
1条答案
按热度按时间gwbalxhn1#
由于使用的是可分页内存,而不是固定内存,cudaMemcpyAsync函数会阻塞调用线程,这与cudaMemcpy函数类似。因此,实际上,您必须从不同的线程调用cudaMemcpy或cudaMemcpyAsync,以实现设备之间的伪并行性。这里需要注意的是,这根本不是异步复制。CPU线程和GPU在复制过程中被阻塞。
如果你想要一个pinned的拷贝,你必须使用固定内存--这是没有办法的。你可以简单地在代码中调用cudaMallocHost()来替换可分页内存
std::vector<double> h_v(dimension, 1.0); // pageable memory
的分配!如果您只想实现设备之间的互操作性,而不想实现主机和设备之间的互操作性,那么您的解决方案已经可行。在这种情况下,您应该将cudaMemcpyAsync替换为cudaMemcpy。