问题类型

Bug

你是否在TensorFlow Nightly版本中复现了这个bug?

否

源代码

二进制文件

TensorFlow版本

tf 2.7.0

自定义代码

是

OS平台和发行版

Linux Ubuntu 18.04

移动设备

无响应*

Python版本

3.7.9

Bazel版本

无响应*

GCC/编译器版本

无响应*

CUDA/cuDNN版本

11.2

GPU型号和内存

无响应*

当前行为？

[目录]
我们的项目使用tf.compat.v1.train.MonitoredTrainingSession创建一个训练会话。通常，我们需要从S3恢复一个预训练的模型。

1. 在我的项目中遇到的错误

在切换到TensorFlow 1之前，我们使用了TensorFlow 1.15.1,并将S3路径传递给checkpoint_dir,如下所示：

import tensorflow as tf
.....
checkpoint_dir = "s3://xxx/xx/"
tf.compat.v1.train.MonitoredTrainingSession(...., checkpoint_dir=checkpoint_dir, ...)

checkpoint_dir 包含恢复变量所需的一切，包括检查点、graph.pbtxt等。一切都运行正常。
在切换到TensorFlow 2.7.0后，我们意识到TensorFlow中引入了模块化文件系统。因此，我们安装了与TensorFlow 2.7.0兼容的TensorFlow-io版本0.23.0,代码变为：

import tensorflow as tf
import tensorflow_io as tfio
.....
checkpoint_dir = "s3://xxx/xx/"
tf.compat.v1.train.MonitoredTrainingSession(...., checkpoint_dir=checkpoint_dir, ...)

然而，它不再起作用，报告了一个错误：

.....
2023-08-02 16:02:40.147093: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:207 : DATA_LOSS: truncated block read
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1380, in _do_call
    return fn(*args)
  File "/root/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1364, in _run_fn
    target_list, run_metadata)
  File "/root/miniconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1458, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: 2 root error(s) found.
  (0) DATA_LOSS: truncated block read
         [[{{node save/RestoreV2}}]]
         [[save/RestoreV2/_1]]
  (1) DATA_LOSS: truncated block read
         [[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.
.....

2. 使用简单代码重现问题

为了排除问题可能是由于我项目中的模型复杂性引起的可能，我使用了一个非常简单的代码来重现它。

2.1 步骤1:训练模型

首先，我使用以下代码训练一个非常简单的模型并将其保存在本地目录中：

import tensorflow as tf

tf.compat.v1.disable_eager_execution()
x = tf.compat.v1.placeholder(tf.float32, shape=(None, 1), name="x")
y = tf.compat.v1.placeholder(tf.float32, shape=(None, 1), name="y")

W = tf.Variable(tf.zeros([1, 1]), name="W")
b = tf.Variable(tf.zeros([1]), name="b")

y_pred = tf.matmul(x, W) + b
loss = tf.reduce_mean(tf.square(y - y_pred))

optimizer = tf.compat.v1.train.GradientDescentOptimizer(0.01)

global_step = tf.compat.v1.train.get_or_create_global_step()
train_op = optimizer.minimize(loss, global_step=global_step)

x_train = [[1], [2], [3], [4]]
y_train = [[0], [-1], [-2], [-3]]

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
hooks = [tf.compat.v1.train.StopAtStepHook(last_step=500)]

checkpoint_dir = './checkpoints'

with tf.compat.v1.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                                 config=config,
                                                 hooks=hooks) as sess:
    while not sess.should_stop():
        sess.run(train_op, feed_dict={x: x_train, y: y_train})

2.2 步骤2:将模型上传到S3

然后，我使用S3工具将 ./checkpoints 中的所有材料上传到远程S3路径：

s3cmd put ./checkpoints/ s3://xxxx/xxx/checkpoints/

2.3 步骤3:从S3恢复模型(错误)

最后，我使用以下代码恢复模型训练，并报告了一个错误：

import tensorflow as tf
import tensorflow_io as tfio

tf.compat.v1.disable_eager_execution()

x = tf.compat.v1.placeholder(tf.float32, shape=(None, 1), name="x")
y = tf.compat.v1.placeholder(tf.float32, shape=(None, 1), name="y")

W = tf.Variable(tf.zeros([1, 1]), name="W")
b = tf.Variable(tf.zeros([1]), name="b")

y_pred = tf.matmul(x, W) + b
loss = tf.reduce_mean(tf.square(y - y_pred))

optimizer = tf.compat.v1.train.GradientDescentOptimizer(0.01)

global_step = tf.compat.v1.train.get_or_create_global_step()

train_op = optimizer.minimize(loss, global_step=global_step)

x_train = [[1], [2], [3], [4]]
y_train = [[0], [-1], [-2], [-3]]

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True

checkpoint_dir = 's3://xxxx/xxx/checkpoints/'

hooks = [tf.compat.v1.train.StopAtStepHook(last_step=2000)]
with tf.compat.v1.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir, config=config, hooks=hooks) as sess:
    while not sess.should_stop():
        sess.run(train_op, feed_dict={x: x_train, y: y_train})

@bingo163,tf.compat.v1.train.MonitoredTrainingSession API是为TensorFlow v1设计的。继续阅读以了解如何从此API迁移到原生的TensorFlow v2等效项。有关如何迁移其余代码的说明，请参阅TensorFlow v1 to TensorFlow v2 migration guide。
另外，是否有使用tf.compat.v1.train.MonitoredTrainingSession API和TensorFlow v2.7的特殊原因？我请求升级到最新稳定版本2.13。谢谢！
@tilakrayal
感谢快速回复。
我们的项目仍在使用tf.compat.v1.train.MonitoredTrainingSession API和TF 2.7版本，原因是更改API使用或升级TF版本将涉及许多方面的重新验证，包括模型收敛。因此，我们希望在尽可能不更改API使用和升级TF版本的情况下找到此错误的原因。
此外，根据您的建议，我已经将测试代码环境的TF版本和tensorflow-io版本升级到了2.13.0和0.33.0,但仍然出现了相同的错误。

8条答案

按热度按时间

of1yzvn41#

@bingo163,

tf.compat.v1.train.MonitoredTrainingSession API是为TensorFlow v1设计的。继续阅读以了解如何将此API迁移到原生的TensorFlow v2等效项。有关如何迁移其余代码的说明，请参阅TensorFlow v1 to TensorFlow v2 migration guide。

另外，是否有使用tf.compat.v1.train.MonitoredTrainingSession API和TensorFlow v2.7的特定原因？我请求升级到最新稳定版本2.13。谢谢！

赞(0）回复(0）举报 6个月前

lqfhib0f2#