TensorFlow Dataset.from_generator泄漏随机生成样本的内存(与'cuda_malloc_async'相关)

fivyi3re 于 5个月前发布在其他

关注(0)|答案(6)|浏览(52)

问题类型

Bug

来源

二进制文件

Tensorflow版本

2.10,2.9也进行了测试

自定义代码

无

OS平台和发行版

Ubuntu Linux 22.04和Google Colab测试过

移动设备

无响应*

Python版本

3.9

Bazel版本

无响应*

GCC/编译器版本

无响应*

CUDA/cuDNN版本

11.8

GPU型号和内存

无响应*

当前行为？

When dataset is generated via `Dataset.from_generator`, samples that have already been used are not removed from GPU memory (or at least this is what I presume), leading to an increased use of VRAM and finally an OOM error as the training proceeds.

This is an important bug, since very often (as in this case) someone might use `from_generator` to perform augmentation etc, and It can be easily seen by examining the `GPU_mem_usage`, that the memory usage indeed grows. For some reason, Colab allocates memory straight away to ~8 GB, so the growth is only noticeable after around 300-500 epochs. The problem is way worse when training on real, larger datasets.

For a simple summary, just view the graphs on the end of the notebook.

Notice (in the notebook), that even though total memory taken on the GPU grows, TensorFlow "thinks" it is consuming basically constant amount of memory, which points into the direction of a memory leak.

Also, it takes around 900 epochs to fill almost the whole memory (16 GB in case of Colab). Epochs consist of 1024 images of shape (64, 64, 3), float32 dtype, giving 1024*3*64*64*900*4 bytes allocated in total, which is around 45 GB, so I presume not all memory is leaked, OR leaking data is not the cause of the problem.

Notice, however, that MobileNetV3 (which I have used in this example) is roughly 18MB in size, and 18MB*900 epochs is basically the aforementioned 16 GB. This could possibly mean that the model's state is leaked somehow, but I have yet to test this hypothesis (for example by using a larger model and checking when the leak happens, i.e. if `num_epochs_until_crash*model_size==gpu_vram_capacity`).

Also, I have tested multiple scenarios locally, and this seems NOT to happen without the `os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'` set, however I have no relevant logs to prove it, since when the flag is absent, whole GPU memory is allocated, and I have no means of monitoring it relevantly, since it doesn't change. I have, however, performed 4 tests and none resulted in OOM error after 1500 epochs, while locally they would usually throw OOM after 400 epochs (RTX 3060Ti).

重现问题的独立代码

I have provided a minimal example on google colab, can be viewed here:
https://colab.research.google.com/drive/1-ANVp8KF9irKvqdR390QNlop-pPhGMrU?usp=sharing
I think the example is self-explanatory.

6条答案

按热度按时间

7xzttuei1#

我能够在谷歌Colab中复现这个问题，但由于训练时间较长，Colab在第858个epoch停止。但日志中仍然捕捉到了预期的行为。请在此附件的gist中找到。

赞(0）回复(0）举报 4个月前

iyr7buue2#

你好 @Szustarol ,

默认情况下，TensorFlow将所有可见于进程的GPU内存(受CUDA_VISIBLE_DEVICES限制)Map到几乎所有的GPU上。因此，我尝试创建逻辑设备来检查TF实际使用的内存。

从get_memory_info和API的文档中可以看出，字典仅指定TensorFlow实际使用的当前和峰值内存，而不是TensorFlow在GPU上分配的内存。因此，我使用了tf.config.experimental.get_memory_info('GPU:0')['current']来跟踪TF进程当前使用的内存，而你为变量smi_ret推导出的nvidia内存可能是用于存储模型配置、检查点、日志等的累积内存，随着迭代次数的增加而不断增长。请参考附件中的gist了解我的案例。

在你的情况下，由于没有限制内存增长，并且使用了'peak'来跟踪内存分配，因此TF在GPU上分配了所有内存，而smi_ret的内存值仅在856个epoch之后才开始增加。希望这能让你理解这个问题，尽管它不是一个优雅的解决方案。

请参阅其他开发者关于类似问题的评论comment1和comment2,它们可能对理解这个案例有所帮助。

赞(0）回复(0）举报 4个月前

zu0ti5jz3#

你好@SuryanarayanaY,
我确实测量了"当前"内存使用量，但我不确定这是否真的是问题所在，所以我选择监控峰值值，这应该更好地表明TensorFlow实际上认为它分配的内存比实际分配的内存少。我发现的问题是，随着训练的进行，GPU上累积分配的内存增长超过8GB,最终在大约800个epoch后导致OOM错误，这是不应该发生的，而且在cuda_malloc_async标志未设置时(根据我的测试)也不会发生。
我可能错了，但对我来说，这是一个在某些情况下(设置标志)导致OOM的内存泄漏。
我想说的是：

要么内存在应该自动清除的时候没有被清除(就像数据集生成器的先前样本没有被释放一样),
或者出于某种原因，cuda_malloc_async在GPU内存上分配了额外的数据，导致VRAM使用随着训练的进行而增长。

为了进一步说明问题，我提供了另一个colab笔记本：
https://colab.research.google.com/drive/1R1w3HVb3JIqNftF_FsnBT6OwvWsYwNmW?usp=sharing
这个笔记本与前一个笔记本之间的唯一区别是，cuda_malloc_async标志没有被设置。现在你可以看到，即使训练进行了1500个epoch,GPU占用的内存从未超过约8.5GB,而你可以确认，大约900个epoch就足以使malloc_async版本崩溃。

赞(0）回复(0）举报 4个月前

ubby3x7f4#

请sachinprasadhs查看这个问题。

赞(0）回复(0）举报 4个月前

kcrjzv8t5#

有关此问题的任何更新吗？

赞(0）回复(0）举报 4个月前

p5fdfcr16#

如果在cuda_malloc_async未设置时问题不发生，这使得错误似乎与cuda_malloc_async有关，而不是Dataset.from_generator。需要熟悉GPU内存分配的人来调查这个问题。

赞(0）回复(0）举报 4个月前