Paddle [NVIDIA] Memory pool issue

jvlzgdj9 于 2021-11-30 发布在 Java

关注(0)|答案(10)|浏览(361)

为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：

标题：简洁、精准概括您的问题
版本、环境信息：

1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
2）CPU/GPU：如果您使用GPU训练，请提供GPU驱动版本、CUDA和cuDNN版本号
3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
4）Python版本号
5）显存信息
注：您可以通过执行summary_env.py获取以上信息。

复现信息：如为报错，请给出复现环境、复现步骤
问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段

Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
If there is no solution,please provide us with the following details :

System information

-PaddlePaddle version:2.1.2
-CPU: including CPUMKL/OpenBlas/MKLDNN version:NA
-GPU: including CUDA/cuDNN version:cuDNN Version: 8.1, CUDA version: 11.1.105, Nvidia driver version: 450.51.06
-OS Platform and Distribution(eg.Mac OS 10.14) -Ubuntu 20.04
-Python version: Python 3.8.10
Note: You can get most of the information by running summary_env.py.

To Reproduce

Steps to reproduce the behavior
Install DALI build from https://github.com/JanuszL/DALI/tree/non_owning_dl_tensors (it is the upstream DALI + changes in plugin/paddle.py).
Remove fluid.core._cuda_synchronize(pd_gpu_place) call.
Run Rn50 DALI example https://github.com/NVIDIA/DALI/tree/main/docs/examples/use_cases/paddle/resnet50 with 8GPUs

Describe your current behavior

Once for hundreds iterations, we get:
Error: /paddle/paddle/fluid/operators/softmax_with_cross_entropy_op.cu:536 Assertionlabels_[idx_lbl] >= 0 && labels_[idx_lbl] < d_ || labels_[idx_lbl] == ignore_idx_failed. The value of label[70] expected >= 0 and < 1000, or == -100,but got -5252642118288355952. Please check input value
What indicates that the data with labels copied from DALI to the PaddlePaddle tensor is corrupted.
The mentioned DALI change (NVIDIA/DALI#3305) removes ownership of FW tensors from DALI. Now the tensors that DALI FW iterator returns are no longer reused every iteration by DALI, but a new tensor is allocated and the user can do whatever he wants with the previously returned without a fear that DALI will overwrite the data.
I guess it used to work as the tensor had always valid data from the previous iteration and even a corruption during copy happened it was not detected. Now every time a fresh tensor is used and the issue is visible. Adding fluid.core._cuda_synchronize(pd_gpu_place) call after PaddlePaddle tensors are allocated seems to solve the issue.

Code to reproduce the issue
Other info / logs

@willthefrog - you did the original integration between DALI and PaddlePaddle. Maybe there is something obviously wrong with the approach proposed in NVIDIA/DALI#3305.

Paddle

来源：https://github.com/PaddlePaddle/Paddle/issues/35555

10条答案

按热度按时间

qhhrdooz1#

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

赞(0）回复(0）举报 2021-11-30

7dl7o3gd2#

add PIC for Paddle with DALI integration - @zhiqiu

赞(0）回复(0）举报 2021-11-30

ocebsuys3#

Hi, @jeng1220@JanuszL I tried to compile your PR3305 but failed these days. Could you give me a built wheel package for debug?

And after reading your PR, I suggest that you can try to use cudaEventRecord on compute stream +cudaStreamWaitEvent on dali stream instead of fluid.core._cuda_synchronize (which sync all the paddle compute streams) between mutable_data and feed_ndarray.

pesudo code:

ptr = lod_tensor._mutable_data(category_place[cat],
                                               category_pd_type[cat])
event = cudaCreateEvent()
cudaEventRecord(event, paddle.device.cuda.current_stream(device_id))
cudaStreamWaitEvent(copy_stream, event)
feed_ndarray(tensor, ptr, copy_stream)

Just like paddle's dataloder doing:

Paddle/paddle/fluid/operators/reader/buffered_reader.cc

Lines 158 to 198 in 97a73e1

| | // NOTE(zjl): cudaStreamWaitEvent() must be called after all |
| | // cuda[i].mutable_data() is called, since some ops release |
| | // cuda memory immediately without waiting cuda kernel ends |
| | platform::SetDeviceId( |
| | BOOST_GET_CONST(platform::CUDAPlace, place_).device); |
| | #ifdef PADDLE_WITH_HIP |
| | PADDLE_ENFORCE_CUDA_SUCCESS( |
| | hipEventRecord(events_[i].get(), compute_stream_)); |
| | PADDLE_ENFORCE_CUDA_SUCCESS( |
| | hipStreamWaitEvent(stream_.get(), events_[i].get(), 0)); |
| | #else |
| | PADDLE_ENFORCE_CUDA_SUCCESS( |
| | cudaEventRecord(events_[i].get(), compute_stream_)); |
| | PADDLE_ENFORCE_CUDA_SUCCESS( |
| | cudaStreamWaitEvent(stream_.get(), events_[i].get(), 0)); |
| | #endif |
| | |
| | platform::RecordEvent record_event("BufferedReader:MemoryCopy"); |
| | for (size_t i = 0; i < cpu.size(); ++i) { |
| | auto cpu_place = cpu[i].place(); |
| | auto cpu_ptr = cpu[i].data(); |
| | auto gpu_ptr = gpu_ptrs[i]; |
| | auto size = |
| | cpu[i].numel() * paddle::framework::SizeOfType(cpu[i].type()); |
| | if (platform::is_cuda_pinned_place(cpu_place)) { |
| | memory::Copy(BOOST_GET_CONST(platform::CUDAPlace, place_), gpu_ptr, |
| | BOOST_GET_CONST(platform::CUDAPinnedPlace, cpu_place), |
| | cpu_ptr, size, stream_.get()); |
| | } elseif ((platform::is_gpu_place(cpu_place))) { |
| | memory::Copy(BOOST_GET_CONST(platform::CUDAPlace, place_), gpu_ptr, |
| | BOOST_GET_CONST(platform::CUDAPlace, cpu_place), |
| | cpu_ptr, size, stream_.get()); |
| | } else { |
| | platform::CUDAPinnedPlace cuda_pinned_place; |
| | framework::LoDTensor cuda_pinned_tensor; |
| | cuda_pinned_tensor.Resize(cpu[i].dims()); |
| | auto cuda_pinned_ptr = cuda_pinned_tensor.mutable_data( |
| | cuda_pinned_place, cpu[i].type()); |
| | memory::Copy(cuda_pinned_place, cuda_pinned_ptr, |
| | BOOST_GET_CONST(platform::CPUPlace, cpu_place), |
| | cpu_ptr, size); |

赞(0）回复(0）举报 2021-11-30

4ngedf3f4#

Hi @zhiqiu,

Thank you for looking into that issue. To repro it, you don't have to build DALI, the changes from the mentioned PR relate only to the python part, so you can install the latest DALI and just update the relevant python files in the location where your wheel is installed (in my case /usr/local/lib/python3.6/dist-packages/nvidia/dali/plugin/).

Just to be sure, do I need to call cudaEventRecord->cudaStreamWaitEvent or can I just feed_ndarray(tensor, ptr, paddle.device.cuda.current_stream(device_id)). This way DALI would do the copy on the same stream as PaddlePaddle ops operate on so it should be safe.

Another question - when the python API that exposes CUDA streams is going to be available (I see that is available only on the development branch).

赞(0）回复(0）举报 2021-11-30

1l5u6lss5#

Thanks. I will try your method.

I think the better way is to use two stream and use event to sync them, in that case, the data copy stream and compute stream can overlap to improve performance.

The paddle.device.cuda.current_stream(device_id)) will be released in Paddle-2.2rc (next month). If you need to use it, you can install the nightly-build version. Here is the command for cuda-11.2 version:
python -m pip install paddlepaddle-gpu==0.0.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html

赞(0）回复(0）举报 2021-11-30

bvjveswy6#

@zhiqiu,

I'll give it a try. Thanks for the help.

赞(0）回复(0）举报 2021-11-30

pu82cl6c7#

Hi @zhiqiu,

I tested the method by creating and recording the event. It seems to work fine.

In the case of DALI device2device copy is performed and it cannot overlap with other kernels. However I cannot do feed_ndarray(tensor, ptr, paddle.device.cuda.current_stream(device_id)) as current_stream call returns PaddlePaddle object and it is not possible to extract a raw cuda stream that DALI expects. I checked the API and I was not able to find way, even raw_stream method from cuda_stream.h is not exposed in the binary so I cannot call it indirectly (using ctypes.CDLL).
Do you plan to expose streams and events raw pointer so other libraries can use them?

赞(0）回复(0）举报 2021-11-30

s4chpxco8#

OK, I got it. I will find how to expose the raw pointer of current_stream.

赞(0）回复(0）举报 2021-11-30

db2dz4w89#

Hi, @JanuszL Please refer to: #35813

赞(0）回复(0）举报 2021-11-30

bqujaahr10#

Hi @zhiqiu,

That should do. I will check the nightly build as soon as it is merged. Thanks a lot.

赞(0）回复(0）举报 2021-11-30