paddle cloud job：问了cloud的同学让我来这里问
job-0bb5fc9ec1c4c160
http://10.127.23.139:8388/v1/containers/8de1dba94ed3e6c2acf2ab51e085575437b2b9d37a3eeef55a1777da6846b079/backuplog
单击多卡任务，输入tf record文件超过100G，但是训练一直没问题，直到20多万轮报这个错，是训练文件格式不对吗？
如果是格式问题，能跳过格式错误的数据吗？

ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
(0) Unknown: afs/input/slot_classify/tf_record.cuid_txt.1207; Input/output error
node IteratorGetNext (defined at train_cuid_item.py:1073)
replica_4/gradients/replica_4/bert/encoder/layer_7/attention/self/query/BiasAdd_grad/BiasAddGrad/_14621
(1) Unknown: afs/input/slot_classify/tf_record.cuid_txt.1207; Input/output error
node IteratorGetNext (defined at train_cuid_item.py:1073)
0 successful operations.
7 derived errors ignored.

Original stack trace for 'IteratorGetNext':
File "train_cuid_item.py", line 1174, in
tf.app.run()
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train_cuid_item.py", line 1073, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_default
input_fn, ModeKeys.TRAIN))
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1022, in _get_features_and_labels_from_input_fn
self._call_input_fn(input_fn, mode))
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/util.py", line 65, in parse_input_fn_result
result = iterator.get_next()
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 426, in get_next
output_shapes=self._structure._flat_shapes, name=name)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1947, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(args,kwargs)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, ininit*
self._traceback = tf_stack.extract_stack()

为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：

标题：简洁、精准概括您的问题，例如“Insufficient Memory xxx" ”
版本、环境信息：

1）PaddlePaddle版本：
2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本
注：您可以通过执行summary_env.py获取以上信息。

训练信息

1）单机/多机，单卡/多卡
2）显存信息
3）Operator信息

复现信息：如为报错，请给出复现环境、复现步骤
问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段

Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is a training issue including the following details:

System information

-PaddlePaddle version （eg.1.1）or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg.Mac OS 10.14)
-Other imformation: Distriuted training/informantion of operator/
Graphics card storage
Note: You can get most of the information by running summary_env.py.

To Reproduce

Steps to reproduce the behavior

Describe your current behavior
Code to reproduce the issue
Other info / logs

2条答案

按热度按时间

xpszyzbs1#

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

赞(0）回复(0）举报 2021-12-07

flmtquvp2#

同学你是这个tensorflow的错误吧。。好像没有使用paddle相关的模块

Paddle 读取tf_record文件，训练到20万轮报错：Original stack trace for 'IteratorGetNext'

2条答案

相关问题

热门标签

最新问答