系统信息
- 是否编写了自定义代码(与在TensorFlow中使用提供的库存示例脚本相反):我正在使用RoBERTa Pytorch-XLA https://cloud.google.com/tpu/docs/tutorials/roberta-pytorch的XLA示例。
- OS平台和发行版(例如,Linux Ubuntu 16.04):debian-9-torch-xla-v20201225(GCP镜像)
- GCP机器类型:自定义8vCPUs,256GB内存
- 从哪里安装了TensorFlow(源代码或二进制文件):在GCP镜像上提供
- TensorFlow版本(请使用以下命令):torch-xla-1.7
- Python版本:Python 3.6.10 :: Anaconda,Inc.
- Bazel版本(如果从源代码编译):N/A
- GCC/编译器版本(如果从源代码编译):N/A
- CUDA/cuDNN版本:N/A
- GPU型号和内存:TPU V3-8
描述当前行为
在使用Pythorch-XLA进行一段时间的训练后,出现了以下错误:
2020-12-28 01:56:05.252085: W 1417 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251970000","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
*** Begin stack trace ***
tensorflow::CurrentStackTrace()
xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::string const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
xla::XrtComputationClient::HandleReleaser()
xla::util::TriggeredTask::Runner()
clone
*** End stack trace ***
我是按照这里描述的步骤进行操作的,使用了相同的网络参数,只是数据集不同。
之前遇到过这个问题,但是因为在恢复检查点后VM上的OOM导致的,这就是为什么我增加了VM内存的原因。
似乎TPU不知何故被抢占了,但我没有访问运行时日志,因为这个错误是在夜间发生的,而且TFRC自动删除了它。
描述预期行为
训练应该像预期的那样继续进行。
独立代码以重现问题
https://cloud.google.com/tpu/docs/tutorials/roberta-pytorch
训练数据约为40GB。
其他信息/日志
包括任何有助于诊断问题的日志或源代码。如果包括回溯,请包括完整的回溯。大型日志和文件应附加。
| epoch 002 | training on xla:0/1: 4151 / 10099 loss=1.804, nll_loss=1.804, wps=17811, ups=0, wpb=117675.783, bsz=296.068, num_updates=14249, lr=0.000491148, gnorm=0.345, oom=0.000, wall=27948, train_wall=92610, now=01:54:20
| epoch 002 | training on xla:0/7: 4151 / 10099 loss=1.805, nll_loss=1.805, wps=17811, ups=0, wpb=117678.381, bsz=296.074, num_updates=14249, lr=0.000491148, gnorm=0.345, oom=0.000, wall=27948, train_wall=92609, now=01:54:20
| epoch 002 | training on xla:0/3: 4151 / 10099 loss=1.805, nll_loss=1.805, wps=17810, ups=0, wpb=117668.251, bsz=296.137, num_updates=14249, lr=0.000491148, gnorm=0.345, oom=0.000, wall=27949, train_wall=92608, now=01:54:20
2020-12-28 01:56:05.252059: W 1436 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251908254","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252085: W 1417 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251970000","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252085: W 1416 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251940620","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252162: W 1379 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252025037","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252205: W 1438 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252117134","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252251: W 1465 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252143973","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252279: W 1483 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252130522","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252398: W 1464 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252301762","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252431: W 1428 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252333413","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252452: W 1341 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252361631","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252472: W 1400 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252380523","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252456: W 1345 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252299434","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252541: W 1378 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252405772","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252553: W 1423 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252493206","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252618: W 1397 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252489431","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252674: W 1480 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252561700","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
terminate called after throwing an instance of 'std::runtime_error'
what(): tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1110 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Aborted: Session a57840b79b1bd972 is not found. vs. OK)
*** Begin stack trace ***
tensorflow::CurrentStackTrace()
xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::string const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
xla::XrtComputationClient::HandleReleaser()
xla::util::TriggeredTask::Runner()
clone
*** End stack trace ***
目前,我在另一个TPU节点上恢复了训练,但检查内存使用情况时,似乎每个训练步骤都在增加。是否可能是TPU上发生了OOM并变得不可用?
2条答案
按热度按时间mi7gmzs61#
你好,
感谢你打开这个问题。由于这个问题已经开放了很长时间,这个问题的代码/调试信息可能与当前代码库的状态不相关。
Tensorflow团队正在不断通过修复错误和添加新功能来改进框架。我们建议你尝试使用最新的TensorFlow version 和最新的兼容硬件配置,这可能会解决该问题。如果你仍然遇到问题,请创建一个新的GitHub问题,附上你的最新发现以及所有有助于我们调查的调试信息。
请按照 release notes 了解Tensorflow空间中最新发展的动态。
6pp0gazn2#
你好,soares-f。我遇到了相同的问题(尽管使用的是自定义模型),每个epoch的TPU内存都有类似的增加。
你最终弄清楚原因了吗?