keras 如何修复“资源耗尽错误：分配Tensor时的OOM”

bpsygsoo 于 2022-11-30 发布在其他

关注(0)|答案(4)|浏览(373)

我想做一个有多个输入的模型。所以，我试着做一个这样的模型。

# define two sets of inputs
inputA = Input(shape=(32,64,1))
inputB = Input(shape=(32,1024))
 
# CNN
x = layers.Conv2D(32, kernel_size = (3, 3), activation = 'relu')(inputA)
x = layers.Conv2D(32, (3,3), activation='relu')(x)
x = layers.MaxPooling2D(pool_size=(2,2))(x)
x = layers.Dropout(0.2)(x)
x = layers.Flatten()(x)
x = layers.Dense(500, activation = 'relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(500, activation='relu')(x)
x = Model(inputs=inputA, outputs=x)
 
# DNN
y = layers.Flatten()(inputB)
y = Dense(64, activation="relu")(y)
y = Dense(250, activation="relu")(y)
y = Dense(500, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)
 
# Combine the output of the two models
combined = concatenate([x.output, y.output])
 

# combined outputs
z = Dense(300, activation="relu")(combined)
z = Dense(100, activation="relu")(combined)
z = Dense(1, activation="softmax")(combined)

model = Model(inputs=[x.input, y.input], outputs=z)

model.summary()

opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = opt,
    metrics = ['accuracy'])

和摘要：_
但是，当我试着训练这个模型时，

history = model.fit([trainimage, train_product_embd],train_label,
    validation_data=([validimage,valid_product_embd],valid_label), epochs=10, 
    steps_per_epoch=100, validation_steps=10)

问题发生....：

ResourceExhaustedError                    Traceback (most recent call
 last) <ipython-input-18-2b79f16d63c0> in <module>()
 ----> 1 history = model.fit([trainimage, train_product_embd],train_label,
 validation_data=([validimage,valid_product_embd],valid_label),
 epochs=10, steps_per_epoch=100, validation_steps=10)

 4 frames
 /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py
 in __call__(self, *args, **kwargs)    1470         ret =
 tf_session.TF_SessionRunCallable(self._session._session,    1471      
 self._handle, args,
 -> 1472                                                run_metadata_ptr)    1473         if run_metadata:    1474          
 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
 
 ResourceExhaustedError: 2 root error(s) found.   (0) Resource
 exhausted: OOM when allocating tensor with shape[800000,32,30,62] and
 type float on /job:localhost/replica:0/task:0/device:GPU:0 by
 allocator GPU_0_bfc     [[{{node conv2d_1/convolution}}]] Hint: If you
 want to see a list of allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
     [[metrics/acc/Mean_1/_185]] Hint: If you want to see a list of
 allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
   (1) Resource exhausted: OOM when allocating tensor with
 shape[800000,32,30,62] and type float on
 /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc    
 [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of
 allocated tensors when OOM happens, add
 report_tensor_allocations_upon_oom to RunOptions for current
 allocation info.
 
 0 successful operations. 0 derived errors ignored.

谢谢你的阅读，希望能帮助我：）

keras

来源：https://stackoverflow.com/questions/59394947/how-to-fix-resourceexhaustederror-oom-when-allocating-tensor

4条答案

按热度按时间

5gfr0r5j1#

OOM* 代表“内存不足”。您的GPU内存不足，因此无法为该Tensor分配内存。您可以执行以下操作：
减少Dense，Conv2D图层中的滤镜数量
使用较小的batch_size（或增加steps_per_epoch和validation_steps）
使用灰度图像（可以使用tf.image.rgb_to_grayscale）
减少层数
在卷积层之后使用MaxPooling2D层
缩小图像的大小（可以使用tf.image.resize）
对输入使用较小的float精度，即np.float32
如果您使用的是预先训练好的模型，请冻结第一个图层（如下所示）

有关此错误的更多有用信息：

OOM when allocating tensor with shape[800000,32,30,62]

这是一个奇怪的形状。如果你正在处理图像，你通常应该有3个或1个通道。最重要的是，看起来你是在一次传递你的整个数据集;您应该成批传递它。

赞(0）回复(0）举报 2022-11-30

lvjbypge2#

从[800000,32,30,62]看，您的模型似乎将所有数据放在一个批处理中。
尝试指定的批处理大小，如

history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10, batch_size=32)

如果仍然OOM，则尝试减少batch_size

赞(0）回复(0）举报 2022-11-30

iibxawm43#

我也是。
您可以尝试通过使用某种形式的迁移学习来减少可训练参数-尝试冻结最初的几个层并使用较低的批处理大小。

赞(0）回复(0）举报 2022-11-30

jv4diomz4#

我认为这种情况最常见的原因是缺少MaxPool层。使用相同的架构，但在Conv2D层之后至少添加一个MaxPool层。这甚至可能提高模型的整体性能。您甚至可以尝试减少模型的深度，即删除不必要的层并进行优化。

赞(0）回复(0）举报 2022-11-30