Paddle Error: cudaMemcpy failed in paddle::platform::GpuMemcpySync

eblbsuwk  于 2021-11-30  发布在  Java
关注(0)|答案(6)|浏览(322)
  • 版本、环境信息:

   1)PaddlePaddle版本:1.6.3
   2)GPU:V100 32g、CUDA 10.0、CUDNN 7.6
   4)系统环境:Ubuntu 16.04、Python3.6.9

  • 训练信息

   1)单机多进程多卡

使用的是文本生成模型,模型训练过程中执行测试时,部分卡上的进程中断出现如下错误:

Exception in thread Thread-2:
Traceback (most recent call last):
File "/root/liwei85/installed-packages/Python3.6.9/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/root/liwei85/installed-packages/Python3.6.9/lib/python3.6/threading.py", line 864, in run
self._target(self._args,self._kwargs)
File "/root/liwei85/envs/paddle1.6_py3.6/lib/python3.6/site-packages/paddle/fluid/layers/io.py", line 474, in
provider_thread
*
six.reraise(*sys.exc_info())
File "/root/liwei85/envs/paddle1.6_py3.6/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/root/liwei85/envs/paddle1.6_py3.6/lib/python3.6/site-packages/paddle/fluid/layers/io.py", line 455, inprovider_thread
for tensors in func():
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 256, in wrapper
examples, batch_size, phase=phase, do_dec=do_dec, place=place):
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 217, in _prepare_batch_data
yield self._pad_batch_records(batch_records, do_dec, place)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 300, in _pad_batch_records
return self._prepare_infer_input(batch_records, place=place)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 350, in _prepare_infer_input
place, [range(trg_word.shape[0] + 1)] * 2)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/graphsum_reader.py", line 342, in to_lodtensor
data_tensor.set(data, place)
paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)

Error Message Summary:

Error: cudaMemcpy failed in paddle::platform::GpuMemcpySync (0x7f2389053f00 -> 0x7f23cf286540, length: 60) error code : 2, Please see detail in https://docs.nvidia.com/cuda/
cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (/paddle/paddle/fluid/platform/gpu_info.cc:288)

Traceback (most recent call last):
File "./src/run.py", line 38, in
run_graphsum(args)
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/run_graphsum.py", line 418, in main
decode_path=args.decode_path + "/test_final_preds")
File "/set/liwei85/projects/baidu/personal-code/GraphSum-Paddle/src/networks/graphsum/run_graphsum.py", line 618, in evaluate
preds.append(dec_out[i][0])
KeyError: 0

k4ymrczo

k4ymrczo1#

请问:

  1. 这个错误可以稳定复现么?预测是单卡预测么?
  2. 您可以使用Print op 查看下dec_out的真实shape, 然后确认下dec_out[i][0]是不是访问越界了?
  3. 您可以result = numpy.array(dec_out), 然后看下result的shape和内容,判断下result[i][0]是不是越界了。
wztqucjr

wztqucjr2#

  1. 用的是多卡预测,是不是最后一步数据不够分给多张卡就会出错?就是,最后一步剩下的数据不能分给所有卡,只有部分卡有数据。
hof1towb

hof1towb3#

有可能是这个原因,建议您根据读入的数据选择不同的执行模式。

6qqygrtg

6qqygrtg4#

『根据读入的数据选择不同的执行模式』是什么意思?多卡预测需要满足什么条件?

tnkciper

tnkciper5#

可以生命两个graph,一个调用with_data_parallel),一个不调用with_data_parallel。最后一个batch调用非并行的graph.

相关问题