1.0.1.so

kpbpu008  于 2021-06-26  发布在  Mesos
关注(0)|答案(0)|浏览(121)

我正在尝试在我的dc/os(v1.8.4)环境中部署一个tensorflow(带有mesos容器),目标节点有gpu资源并且已经安装了nvidia驱动程序
tensorflow失败,mesos的stderr中只有一条错误消息:

mesos-containerizer: error while loading shared libraries: libmesos-1.0.1.so: cannot open shared object file: No such file or directory

在mesos日志中,有以下消息:

Jun 23 10:25:43 gpu-test linker-start-agent.sh[4198]: I0623 10:25:43.069501  4222 linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNS | CLONE_NEWPID
Jun 23 10:25:43 gpu-test linker-start-agent.sh[4198]: I0623 10:25:43.072198  4222 systemd.cpp:96] Assigned child process '4977' to 'mesos_executors.slice'
Jun 23 10:25:43 gpu-test linker-start-agent.sh[4198]: I0623 10:25:43.074686  4222 containerizer.cpp:1319] Checkpointing executor's forked pid 4977 to '/var/lib/
mesos/slave/meta/slaves/e97f452e-17de-4c0e-b07d-b9955bbc0844-S3/frameworks/00da5f21-f3ab-4237-b85e-8b767ef53d43-0000/executors/gpu2-cuda.49c58356-57fe-11e7-afed-56b5da32b775/runs/f5ad3b1b-5beb-45d6-b4d4-bd592ae09be8/pids/forked.pid'
Jun 23 10:25:43 gpu-test linker-start-agent.sh[4198]: I0623 10:25:43.077404  4222 containerizer.cpp:1863] Executor for container 'f5ad3b1b-5beb-45d6-b4d4-bd592ae09be8' has exited
Jun 23 10:25:43 gpu-test linker-start-agent.sh[4198]: I0623 10:25:43.077416  4222 containerizer.cpp:1622] Destroying container 'f5ad3b1b-5beb-45d6-b4d4-bd592ae09be8'
Jun 23 10:25:43 gpu-test linker-start-agent.sh[4198]: E0623 10:25:43.178581  4219 slave.cpp:3976] Container 'f5ad3b1b-5beb-45d6-b4d4-bd592ae09be8' for executor'gpu2-cuda.49c58356-57fe-11e7-afed-56b5da32b775' of framework 00da5f21-f3ab-4237-b85e-8b767ef53d43-0000 failed to start: Collect failed: Failed to setup hostname and network files: Failed to enter the mount namespace of pid 4977: Pid 4977 does not exist

日志显示分叉pid进程时失败。
但是如果我尝试用mesos contianer部署nginx服务,它运行得很好,mesos containerinzer无法加载libmesos-1.0.1。所以当部署tensorflow或cuda服务时,nginx服务运行得很好,所有mesos容器都存在
~我不知道哪里出了问题~

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题