Paddle 模型训练一段时间报GRPC错误

ifmq2ha2  于 2021-11-29  发布在  Java
关注(0)|答案(4)|浏览(392)

模型训练一段时间报grpc错误,且稳定复现。

2019-06-16 16:36:14,803 - INFO - TRAIN --> pass: 0 batch: 100 loss: 0.61961340332 avg_loss :0.646084533691 avg_loss_100 :0.646084533691 auc: 0.525004341001, auc_100: 0.519851564739 ,batch_auc: 0.565066064668
2019-06-16 17:28:30,594 - INFO - TRAIN --> pass: 0 batch: 200 loss: 0.643115722656 avg_loss :0.633750671387 avg_loss_100 :0.621293518066 auc: 0.564058411824, auc_100: 0.595861241349 ,batch_auc: 0.610522291422
2019-06-16 18:22:15,130 - INFO - TRAIN --> pass: 0 batch: 300 loss: 0.586580200195 avg_loss :0.627978393555 avg_loss_100 :0.616376159668 auc: 0.580829420973, auc_100: 0.609009897329 ,batch_auc: 0.623971018562
2019-06-16 19:13:52,029 - INFO - TRAIN --> pass: 0 batch: 400 loss: 0.614757080078 avg_loss :0.625462036133 avg_loss_100 :0.617887817383 auc: 0.588551356393, auc_100: 0.61169839673 ,batch_auc: 0.62135427562
F0616 19:36:56.932013 26878 grpc_client.cc:357] GetRPC name:[embedding_0.w_0.block39], ep:[10.87.137.17:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details:

Check failure stack trace:

@ 0x7f7b5c943cdd google::LogMessage::Fail()
@ 0x7f7b5c94778c google::LogMessage::SendToLog()
@ 0x7f7b5c943803 google::LogMessage::Flush()
@ 0x7f7b5c948c9e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f7b5d398994 paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f7b8972a8a0 execute_native_thread_routine
@ 0x7f7b930db1c3 start_thread
@ 0x7f7b9270312d __clone
@ (nil) (unknown)
.//paddle/start_trainer.sh: line 109: 14192 Aborted /home/disk1/normandy/maybach/app-user-20190616145609-6617/workspace/python27-gcc482//bin/python -u train.py

mrwjdhj3

mrwjdhj31#

有一个机器挂了?

F0616 19:36:56.932013 26878 grpc_client.cc:357] GetRPC name:[embedding_0.w_0.block39], ep:[10.87.137.17:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed
iyzzxitl

iyzzxitl2#

使用SGD可以正常运行,
使用RMS 稳定失败。
所以感觉不是节点问题,像是框架问题?
任务链接
http://10.182.165.13:8900/fileview.html?path=/home/disk1/normandy/maybach/app-user-20190616145824-20267/

lo8azlld

lo8azlld3#

RMSProp 的is_center的设置为False再跑一下?

相关问题