Paddle cpu MPI集群,worker: grpc error, error_code:4 error_message:Deadline Exceeded,server: Tensor holds no memory.

h6my8fg2  于 2021-12-07  发布在  Java
关注(0)|答案(7)|浏览(428)

fluid 1.6.0
embedding层设置了sparse为True,这个embedding层连接到两个网络,最后需要将这两个网络的输出elementwise地乘起来,这时报错。这两个输出的形状都是[batch_size,1]。

PaddleCheckError: Expected ctx->GetInputDim("Y")[0] == 1, but received ctx->GetInputDim("Y")[0]:0 != 1:1.
ShapeError: For elementwise_op, if X is Sparse(VarType.SELECTED_ROWS), Y must be scalar. But reveived the first dimension of Y = 0 at [/paddle/paddle/fluid/operators/elementwise/elementwise_op.h:73]
  [operator < elementwise_mul > error]

***Check failure stack trace:***

F1119 16:00:27.683459 17296 listen_and_serv_op.cc:77] run sub program:1 error

是不是设置sparse为True,这个tensor的类型就变成了VarType.SELECTED_ROWS。

plupiseo

plupiseo1#

看样子是挂在上面这个位置了
Paddle/listen_and_serv_op.cc at develop · PaddlePaddle/Paddle
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/distributed_ops/listen_and_serv_op.cc#L77

vsmadaxz

vsmadaxz2#

@ysh329
cpu分布式训练,MPI集群,paddlecloud任务:job-0bb5dd39ddb5be22,第一个节点日志
http://10.182.98.139:8900/fileview.html?path=/home/disk1/normandy/maybach/app-user-20191119154646-25171/

详细日志:
resource_list: nodes=20,walltime=48:00:00,resource=full

qsubf arguments confirmation:

-N xuzhang_scvr_yangzhifeng01_20191119_paddlecloud

-v PLATFORM=maybach

--conf /home/paddle/cloud/job/job-0bb5dd39ddb5be22/submit/qsub_f.conf

--hdfs afs://tianqi.afs.baidu.com:9902

--ugi fcr-tianqi-d,absUPEwUB7nc

--hout /app/ecom/brand/yangzhifeng01/scvr/output/08b5e88a-7b60-5ee6-b215-bcd484367c19/job-0bb5dd39ddb5be22/

--files [./paddle]

[INFO] client.version: 3.5.0

[INFO] session.id: 19772208.yq01-smart-000.yq01.baidu.com

[INFO] making tar.gz: from [./paddle]

[INFO] making tar.gz done: size=837925

[INFO] uploading the job package finished.

[INFO] qsub_f: jobid=app-user-20191119154646-25171.yq01-hpc-lvliang01-smart-master.dmop.baidu.com, pls waiting for complete!

[INFO] qsub_f: see more at http://yq01-hpc-lvliang01-mon.dmop.baidu.com:8919/taskinfo.html?appId=app-user-20191119154646-25171

[INFO] qsub_f: to stop, pls run: qdel app-user-20191119154646-25171.yq01-hpc-lvliang01-smart-master.dmop.baidu.com

hc2pp10m

hc2pp10m3#

后来我在本地使用paddle.distributed.launch_ps --worker_num 2 --server_num 2模拟分布式环境,报错信息是,用了https://www.paddlepaddle.org.cn/install/doc/tables 的latest Python包。
worker端

F1120 21:26:39.707716  9337 grpc_client.cc:504] GetRPC name:[ctcvr_out.b_0], ep:[127.0.0.1:6170], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:

***Check failure stack trace:***

    @     0x7f087509a7fd  google::LogMessage::Fail()
    @     0x7f087509e2ac  google::LogMessage::SendToLog()
    @     0x7f087509a323  google::LogMessage::Flush()
    @     0x7f087509f7be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f08760e924a  paddle::operators::distributed::GRPCClient::Proceed()
    @     0x7f0893e718a0  execute_native_thread_routine
    @     0x7f08bcd091c3  start_thread
    @     0x7f08bc33112d  __clone
    @              (nil)  (unknown)

server端

----------------------
Error Message Summary:
----------------------
Error: Tensor holds no memory. Call Tensor::mutable_data first.
  [Hint: holder_ should not be null.] at (/paddle/paddle/fluid/framework/tensor.cc:23)
  [operator < elementwise_mul > error]

***Check failure stack trace:***

    @     0x7f3333029973  __GI___pthread_once
    @     0x7f32ebe37abd  _ZNSt17_Function_handlerIFSt10unique_ptrIN6paddle8platform13EnforceNotMetESt14default_deleteIS3_EEvESt17reference_wrapperISt12_Bind_simpleIFS8_IZNS1_9framework10ThreadPool18RunAndGetExceptionIZNS1_9operatorsL21ParallelExecuteBlocksERKSt6vectorImSaImEEPNSA_8ExecutorERKSE_ISt10shared_ptrINSA_22ExecutorPrepareContextEESaISN_EEPNSA_11ProgramDescEPNSA_5ScopeEEUlvE_EESt6futureIS6_ET_EUlvE_EvEEEE9_M_invokeERKSt9_Any_data
    @     0x7f32ebe361a2  _ZNSt13__future_base11_Task_stateIZN6paddle9framework10ThreadPool18RunAndGetExceptionIZNS1_9operatorsL21ParallelExecuteBlocksERKSt6vectorImSaImEEPNS2_8ExecutorERKS6_ISt10shared_ptrINS2_22ExecutorPrepareContextEESaISF_EEPNS2_11ProgramDescEPNS2_5ScopeEEUlvE_EESt6futureISt10unique_ptrINS1_8platform13EnforceNotMetESt14default_deleteISS_EEET_EUlvE_SaIiEFSV_vEE6_M_runEv
    @     0x7f32eb37a7be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f32ebe277ea  std::_Function_handler<>::_M_invoke()
    @     0x7f32eb3757fd  google::LogMessage::Fail()
    @     0x7f32ec8ff069  paddle::framework::ThreadPool::TaskLoop()
    @     0x7f32ebe37abd  _ZNSt17_Function_handlerIFSt10unique_ptrIN6paddle8platform13EnforceNotMetESt14default_deleteIS3_EEvESt17reference_wrapperISt12_Bind_simpleIFS8_IZNS1_9framework10ThreadPool18RunAndGetExceptionIZNS1_9operatorsL21ParallelExecuteBlocksERKSt6vectorImSaImEEPNSA_8ExecutorERKSE_ISt10shared_ptrINSA_22ExecutorPrepareContextEESaISN_EEPNSA_11ProgramDescEPNSA_5ScopeEEUlvE_EESt6futureIS6_ET_EUlvE_EvEEEE9_M_invokeERKSt9_Any_data
    @     0x7f330a18c8a0  execute_native_thread_routine
    @     0x7f32ebe277ea  std::_Function_handler<>::_M_invoke()
    @     0x7f32eb3792ac  google::LogMessage::SendToLog()
    @     0x7f33330241c3  start_thread
    @     0x7f32eb2de1a7  std::__future_base::_State_base::_M_do_set()
    @     0x7f333264c12d  __clone
    @     0x7f32eb375323  google::LogMessage::Flush()
    @     0x7f3333029973  __GI___pthread_once
    @     0x7f32ec8ff069  paddle::framework::ThreadPool::TaskLoop()
    @     0x7f32ebe361a2  _ZNSt13__future_base11_Task_stateIZN6paddle9framework10ThreadPool18RunAndGetExceptionIZNS1_9operatorsL21ParallelExecuteBlocksERKSt6vectorImSaImEEPNS2_8ExecutorERKS6_ISt10shared_ptrINS2_22ExecutorPrepareContextEESaISF_EEPNS2_11ProgramDescEPNS2_5ScopeEEUlvE_EESt6futureISt10unique_ptrINS1_8platform13EnforceNotMetESt14default_deleteISS_EEET_EUlvE_SaIiEFSV_vEE6_M_runEv
    @     0x7f330a18c8a0  execute_native_thread_routine
    @              (nil)  (unknown)

我看其他issue里边,grpc error, error_code:4 error_message:Deadline Exceeded这个问题挺多的,还没有一个比较好的解决方案。

bbuxkriu

bbuxkriu4#

如果不用fleet分布式,代码在本地是正常运行的,网络结构输入输出应该都没有什么问题。

from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.incubate.fleet.base import role_maker
使用PaddleCloudRoleMaker()
if fleet.is_server():
    init_model = './init_model'
    if self.exists_and_contains_file(init_model):
        fleet.init_server(init_model)
    else:
        fleet.init_server()
    fleet.run_server()
elif fleet.is_worker():
    exe = fluid.Executor(place)
    fleet.init_worker()
    exe.run(fluid.default_startup_program())
    self.train_loop(
        main_program=main_program,
        test_program=test_program,
        exe=exe,
        net=net,
        place=place)
    fleet.stop_worker()
aamkag61

aamkag615#

exe.run(fluid.default_startup_program()) -> exe.run(fleet.startup_program)

j8yoct9x

j8yoct9x6#

@guru4elephant
更改为 exe.run(fleet.startup_program),错误没变但是日志变了,还是本地模拟分布式。我打开了GLOG_vmodule=operator=4 , GLOG_logtostderr=1。
当我使用exe.run(fluid.default_startup_program())时,在server端日志好像很多变量初始化就有问题,我grep了elementwise_mul,变量shape不对:
I1120 21:23:39.505406 9450 operator.cc:152] CPUPlace Op(elementwise_mul), inputs:{X[fc_1.w_0@GRAD:float[128, 64]({})], Y[elementwise_div_0:[0]({})]}, outputs:{Out[elementwise_mul_9:[0]({})]}.
但当我使用exe.run(fleet.startup_program)后,server端有的变量似乎正确了:

I1121 17:00:42.707172 19450 operator.cc:152] CPUPlace Op(elementwise_mul), inputs:{X[fc_1.w_0@GRAD:float[128, 64]({})], Y[elementwise_div_0:float[1]({})]}, outputs:{Out[elementwise_mul_9:[0]({})]}.
I1121 17:00:42.707298 19450 operator.cc:172] CPUPlace Op(elementwise_mul), inputs:{X[fc_1.w_0@GRAD:float[128, 64]({})], Y[elementwise_div_0:float[1]({})]}, outputs:{Out[elementwise_mul_9:float[128, 64]({})]}.

但是,它们共同的有一些变量还是不正确。例如,
I1121 17:00:42.701272 19460 operator.cc:152] CPUPlace Op(elementwise_mul), inputs:{X[ctr_out.w_0@GRAD:float[64, 1]({})], Y[elementwise_div_0:[0]({})]}, outputs:{Out[elementwise_mul_4:[0]({})]}.
我发现只要出现两次日志,第一次operator.cc:152,第二次operator.cc:172,这个变量的形状是正常的;如果只出现一次operator.cc:152变量形状就不正常。
运行代码过程中,似乎是server先挂了,然后worker等server等了一会,因为server出问题然后没等到,然后worker也挂了。

同时,它们共同的一些变量是正确的。例如,

I1121 17:21:01.223989 11659 operator.cc:152] CPUPlace Op(elementwise_mul), inputs:{X[fc_0.b_0@GRAD:float[64]({})], Y[elementwise_div_0:[0]({})]}, outputs:{Out[elementwise_mul_7:[0]({})]}.
I1121 17:21:01.225289 11659 operator.cc:172] CPUPlace Op(elementwise_mul), inputs:{X[fc_0.b_0@GRAD:float[64]({})], Y[elementwise_div_0:float[1]({})]}, outputs:{Out[elementwise_mul_7:float[64]({})]}.
fykwrbwg

fykwrbwg7#

我的输入是两个,一个是稀疏输入通过embedding层,另一个是稠密输入通过fc层,在以上用exe.run(fluid.default_startup_program())和exe.run(fleet.startup_program)时,server端日志都出现了embedding层和fc层的梯度的维度错误,如下,

I1121 17:00:42.701313 19459 operator.cc:152] CPUPlace Op(elementwise_mul), inputs:{X[embedding.w@GRAD[row_size=7]:float[7, 64]({{}})], Y[elementwise_div_0:[0]({})]}, outputs:{Out[elementwise_mul_6[row_size=0]:uninited[0]({{}})]}.
I1121 17:00:42.700583 19451 operator.cc:152] CPUPlace Op(elementwise_mul), inputs:{X[dense_input_fc.w@GRAD:float[2, 64]({})], Y[elementwise_div_0:[0]({})]}, outputs:{Out[elementwise_mul_5:[0]({})]}.

相关问题