如何使用tensorflow以编程方式确定可用GPU内存?

11dmarpk  于 2022-11-30  发布在  其他
关注(0)|答案(4)|浏览(185)

对于矢量量化(k-means)程序,我希望知道当前GPU上的可用内存量(如果有的话)。这是选择最佳批处理大小所必需的,以便尽可能少的批处理运行整个数据集。
我写了下面的测试程序:

import tensorflow as tf
import numpy as np
from kmeanstf import KMeansTF
print("GPU Available: ", tf.test.is_gpu_available())

nn=1000
dd=250000
print("{:,d} bytes".format(nn*dd*4))
dic = {}
for x in "ABCD":
    dic[x]=tf.random.normal((nn,dd))
    print(x,dic[x][:1,:2])

print("done...")

这是一个典型的输出在我的系统(ubuntu 18.04 LTS,GTX-1060 6 GB)。请注意核心转储。

python misc/maxmem.py 
GPU Available:  True
1,000,000,000 bytes
A tf.Tensor([[-0.23787294 -2.0841186 ]], shape=(1, 2), dtype=float32)
B tf.Tensor([[ 0.23762687 -1.1229591 ]], shape=(1, 2), dtype=float32)
C tf.Tensor([[-1.2672468   0.92139906]], shape=(1, 2), dtype=float32)
2020-01-02 17:35:05.988473: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000).  Current allocation summary follows.
2020-01-02 17:35:05.988752: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************xx
2020-01-02 17:35:05.988835: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Segmentation fault (core dumped)

有时我会从python中得到一个错误而不是一个内核转储(见下文)。这实际上会更好,因为我可以捕获它,从而通过反复试验来确定最大可用内存。但它与内核转储交替出现:

python misc/maxmem.py 
GPU Available:  True
1,000,000,000 bytes
A tf.Tensor([[-0.73510283 -0.94611156]], shape=(1, 2), dtype=float32)
B tf.Tensor([[-0.8458411  0.552555 ]], shape=(1, 2), dtype=float32)
C tf.Tensor([[0.30532074 0.266423  ]], shape=(1, 2), dtype=float32)
2020-01-02 17:35:26.401156: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000).  Current allocation summary follows.
2020-01-02 17:35:26.401486: W tensorflow/core/common_runtime/bfc_allocator.cc:424] **************************************************************************************************xx
2020-01-02 17:35:26.401571: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "misc/maxmem.py", line 11, in <module>
    dic[x]=tf.random.normal((nn,dd))
  File "/home/fritzke/miniconda2/envs/tf20b/lib/python3.7/site-packages/tensorflow_core/python/ops/random_ops.py", line 76, in random_normal
    value = math_ops.add(mul, mean_tensor, name=name)
  File "/home/fritzke/miniconda2/envs/tf20b/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 391, in add
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1000,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] name: random_normal/

无论软件在哪个系统上运行,我如何才能可靠地获得这些信息?

qyyhg6bp

qyyhg6bp1#

实际上我在this old question of mine中找到了答案。

import nvidia_smi

nvidia_smi.nvmlInit()

handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
# card id 0 hardcoded here, there is also a call to get all available card ids, so we could iterate

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print("Total memory:", info.total)
print("Free memory:", info.free)
print("Used memory:", info.used)

nvidia_smi.nvmlShutdown()

结果如下:

Total memory: 17071734784
Free memory: 17071734784
Used memory: 0

实际GPU我有一个特斯拉P100可以看到从执行

!nvidia-smi

并观察输出

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
7d7tgy0s

7d7tgy0s2#

此代码将返回每个GPU的可用GPU内存(以MB为单位):

import subprocess as sp
import os

def get_gpu_memory():
    command = "nvidia-smi --query-gpu=memory.free --format=csv"
    memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
    return memory_free_values

get_gpu_memory()

这个答案依赖于安装的nvidia-smi(Nvidia GPU几乎总是这样),因此仅限于NVidia GPU。

14ifxucb

14ifxucb3#

如果使用tensorflow-gpu==2.5,则可以使用

tf.config.experimental.get_memory_info('GPU:0')

以获取TF实际消耗的GPU内存。Nvidia-smi什么都不会告诉您,因为TF会为自己分配所有内存,而nvidia-smi不会留下任何信息来跟踪预分配内存中实际使用了多少内存。

zrfyljdw

zrfyljdw4#

总而言之,效果良好的最佳解决方案是使用:tf.config.experimental.get_memory_info('DEVICE_NAME')
此函数返回包含两个关键字的字典:

  • 'current':装置目前使用的内存,以字节为单位
  • 'peak':装置在执行程式期间所使用的尖峰内存,以字节为单位。

这些键的值是实际使用的内存,而不是nvidia-smi返回的分配内存。
实际上,对于GPU,TensorFlow将使用nvidia-smi默认渲染分配所有内存,以检查代码中已使用的内存是否无用。即使tf.config.experimental.set_memory_growth设置为true,Tensorflow也不会再分配整个可用内存,而是将继续分配比已使用内存更多的内存,并以离散方式,即先分配4589MiB,然后分配8717MiB,然后分配16943MiB,最后分配30651MiB,依此类推。
关于get_memory_info()的一个小注意是,如果在tf.function()修饰函数中使用它,它不会返回正确的值。因此,在执行tf.function()修饰函数之后,应该使用peak键来确定使用的峰值内存。
对于较早版本的Tensorflow,tf.config.experimental.get_memory_usage('DEVICE_NAME')是唯一可用的函数,并且仅返回已用内存(没有用于确定峰值内存的选项)。
最后注意,您还可以考虑使用Tensorboard附带的Tensorflow刻画器来获取有关内存使用情况的信息。
希望这对你有帮助:)

相关问题