Paddle 分布式训练模型初始化问题

4xy9mtcn  于 2021-11-29  发布在  Java
关注(0)|答案(2)|浏览(419)

用paddle cloud提交任务训练模型,没有用初始化时可以正常训练,用下面初始化模型的时候都会报错
init_model_path=/app/ecom/native-ad/zhaoyang29/reward_video_nnq/xcxxs/fluidmodel/output/ab28ea5e-cb8b-50b6-9503-b447b99aabed/job-0bb5d31881d04ac9/output/rank-00000/pass-9/
报错:F0731 03:18:58.043956 37734 grpc_client.cc:418] GetRPC name:[sequence_conv_4.b_0], ep:[10.182.76.151:62004], status:[-1] meets grpc error, error_code:14 error_message:Socket closed error_details:

Check failure stack trace:

@ 0x7f5187709c0d google::LogMessage::Fail()
@ 0x7f518770d6bc google::LogMessage::SendToLog()
@ 0x7f5187709733 google::LogMessage::Flush()
@ 0x7f518770ebce google::LogMessageFatal::~LogMessageFatal()
@ 0x7f518830ee0e paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7f5191e008a0 execute_native_thread_routine
@ 0x7f522c78c1c3 start_thread
@ 0x7f522bdb412d __clone
@ (nil) (unknown)
请问是我初始化有问题吗还是什么原因

but5z9lq

but5z9lq1#

具体的问题的原因:加载预训练模型,如果新的模型结构(很多时候修改最后一层的输出的维度)有变化, 需要将对应层的参数的名字修改一下

qij5mzcb

qij5mzcb2#

Wed Jul 31 03:18:58 2019[1,27]:Enforce failed. Expected param_dims == ctx->GetInputDim("Moment1"), but received param_dims:424, 64 != ctx->GetInputDim("Moment1"):21199, 64.
Wed Jul 31 03:18:58 2019[1,27]:Param and Moment1 input of AdamOp should have same dimension at [/paddle/paddle/fluid/operators/optimizers/adam_op.cc:67]
看日志上面的问题可能是初始化模型和训练时候用的优化方法不一致,所以再次提交任务,模型和优化方法都一致,但还有问题
Thu Aug 1 05:10:38 2019[1,13]:F0801 05:10:38.174762 17758 grpc_client.cc:418] SendRPC name:[fc5.w_0@GRAD.trainer_13], ep:[10.182.18.22:62005], status:[-1] meets grpc error, error_code:14 error_message:OS Error error_details:
Thu Aug 1 05:10:38 2019[1,13]:***Check failure stack trace:***
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f41f9f5cc0d google::LogMessage::Fail()
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f41f9f606bc google::LogMessage::SendToLog()
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f41f9f5c733 google::LogMessage::Flush()
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f41f9f61bce google::LogMessageFatal::~LogMessageFatal()
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f41fab61e0e paddle::operators::distributed::GRPCClient::Proceed()
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f42046538a0 execute_native_thread_routine
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f429efdf1c3 start_thread
Thu Aug 1 05:10:38 2019[1,13]: @ 0x7f429e60712d __clone
Thu Aug 1 05:10:38 2019[1,13]: @ (nil) (unknown)

Thu Aug 1 05:05:19 2019[1,17]:+ ret=0
Thu Aug 1 05:05:19 2019[1,17]:+ 0 -ne 0
Thu Aug 1 05:05:19 2019[1,17]:+ log_info 'download from [/app/ecom/native-ad/zhaoyang29/reward_video_nnq/xcxxs/fluidmodel/output/ab28ea5e-cb8b-50b6-9503-b447b99aabed/job-0bb5d3c75135c552/output/rank-00000/pass-8] to /home/disk1/normandy/maybach/app-user-20190731235312-6982/workspace/env_run/init_model success'
Thu Aug 1 05:05:19 2019[1,17]:+ echo '[./paddle/hadoop_functions.sh : 119] [hadoop_get_file]'
Thu Aug 1 05:05:19 2019[1,17]:[./paddle/hadoop_functions.sh : 119] [hadoop_get_file]
Thu Aug 1 05:05:19 2019[1,17]:+ echo '[INFO]: download from [/app/ecom/native-ad/zhaoyang29/reward_video_nnq/xcxxs/fluidmodel/output/ab28ea5e-cb8b-50b6-9503-b447b99aabed/job-0bb5d3c75135c552/output/rank-00000/pass-8] to /home/disk1/normandy/maybach/app-user-20190731235312-6982/workspace/env_run/init_model success'
Thu Aug 1 05:05:19 2019[1,17]:[INFO]: download from [/app/ecom/native-ad/zhaoyang29/reward_video_nnq/xcxxs/fluidmodel/output/ab28ea5e-cb8b-50b6-9503-b447b99aabed/job-0bb5d3c75135c552/output/rank-00000/pass-8] to /home/disk1/normandy/maybach/app-user-20190731235312-6982/workspace/env_run/init_model success
Thu Aug 1 05:05:19 2019[1,17]:+ break
Thu Aug 1 05:05:19 2019[1,17]:+ return 0
任务链接:http://yq01-hpc-lvliang01-mon.dmop.baidu.com:8919/taskinfo.html?appId=app-user-20190731235312-6982

另外一个任务,没有采用初始化,训练到第四轮的时候遇到问题
F0731 21:39:21.411710 7248 grpc_client.cc:408] SendRPC name:[output.w_0@GRAD.trainer_0], ep:[10.182.75.160:62006], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:

Check failure stack trace:

@ 0x7fb426c09c0d google::LogMessage::Fail()
@ 0x7fb426c0d6bc google::LogMessage::SendToLog()
@ 0x7fb426c09733 google::LogMessage::Flush()
@ 0x7fb426c0ebce google::LogMessageFatal::~LogMessageFatal()
@ 0x7fb42780ee0e paddle::operators::distributed::GRPCClient::Proceed()
@ 0x7fb432d188a0 execute_native_thread_routine
@ 0x7fb4cd6a41c3 start_thread
@ 0x7fb4ccccc12d __clone
@ (nil) (unknown)
任务链接:http://yq01-hpc-lvliang01-mon.dmop.baidu.com:8919/taskinfo.html?appId=app-user-20190728134105-1287

相关问题