paddle version:1.3.0
代码路径:
https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/icnet
多机pserver模式下,报错;
如果将learning_rate设置为固定值,多机就可以运行;
https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/icnet/train.py#L85
报错如下:
I0318 15:42:44.799264 16115 grpc_server.cc:430] Server listening on 127.0.0.1:9121 selected port: 9121
F0318 15:43:17.110155 18239 listen_and_serv_op.cc:74] run sub program:6 error Invoke operator momentum error.
Python Callstacks:
File "/home/paddle/.jumbo/lib/python2.7/site-packages/paddle/fluid/framework.py", line 1317, in append_op
attrs=kwargs.get("attrs", None))
File "/home/paddle/.jumbo/lib/python2.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 1836, in _append_pser
ver_ops
attrs=opt_op.all_attrs())
File "/home/paddle/.jumbo/lib/python2.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 773, in __append_opti
mize_op__
self.origin_program, merged_var)
File "/home/paddle/.jumbo/lib/python2.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 845, in get_pserver_p
rogram
lr_ops)
File "/home/paddle/ljh/baidu/paddle/test/cts_test/dist_base.py", line 72, in run_pserver
pserver_prog = t.get_pserver_program(current_endpoint)
File "/home/paddle/ljh/baidu/paddle/test/cts_test/dist_base.py", line 389, in runtime_main
model.run_pserver(endpoints, trainers, current_endpoint, trainer_id, run_params)
File "dist_icnet.py", line 155, in <module>
runtime_main(TestDistIcnet)
C++ Callstacks:
Enforce failed. Expected framework::product(ctx->GetInputDim("LearningRate")) == 1, but received framework::product(ctx->GetInputDim(
"LearningRate")):0 != 1:1.
Learning_rate should be a scalar at [/paddle/paddle/fluid/operators/optimizers/momentum_op.h:67]
PaddlePaddle Call Stacks:
0 0x7f71d9e09e2dp void paddle::platform::EnforceNotMet::Init<std::string>(std::string, char const*, int) + 365
1 0x7f71d9e0a177p paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int) + 87
2 0x7f71daee263ep paddle::operators::MomentumOp::InferShape(paddle::framework::InferShapeContext*) const + 4350
3 0x7f71db81ab7bp paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platfo
rm::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::
void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, bo
ost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::deta
il::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varia
nt::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 603
4 0x7f71db818425p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPl
ace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boo
st::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detai
l::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian
t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_> const&) + 341
5 0x7f71d9f2933ap paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework
::Scope*, bool, bool, bool) + 218
6 0x7f71da9d9692p
7 0x7f71da9e094ap std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_D
eleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::unique_ptr<paddle::platform::EnforceNot
Met, std::default_delete<paddle::platform::EnforceNotMet> > >, std::__future_base::_Result_base::_Deleter>, std::unique_ptr<paddle::p
latform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > >::_M_invoke(std::_Any_data const&) + 42
6条答案
按热度按时间ne5o7dgx1#
看着像是这个函数返回的结果异常,我对这个模型不是特别熟悉,可否加个Print op检查下变量内容是否正常?
https://github.com/PaddlePaddle/models/blob/bbdb2469ab676f562c6a3666879425474f57046a/fluid/PaddleCV/icnet/train.py#L57
wmtdaxz32#
print decayed_lr 这个吗?
iklwldmw3#
poly_decay()这个函数的返回值
uidvcgyl4#
PS,单机可以正常运行吗?
x4shl7ld5#
PS,单机可以正常运行吗?
单机可以运行
okxuctiv6#
poly_decay()这个函数的返回值
输出如下:
decayed_lr is: Tensor[tmp_35]
shape: [1,]
dtype: f
data: 0.003,