无法重新注册mesos代理

wbgh16ku  于 2021-06-26  发布在  Mesos
关注(0)|答案(1)|浏览(330)

更新属性后( isolation )的 mesos-slave 未能重新注册:

6868 status_update_manager.cpp:177] Pausing sending status updates
6877 slave.cpp:915] New master detected at master@192.168.1.1:5050
6867 status_update_manager.cpp:177] Pausing sending status updates
6877 slave.cpp:936] No credentials provided. Attempting to register without authentication
6877 slave.cpp:947] Detecting new master
6869 slave.cpp:1217] Re-registered with master master@192.168.1.1:5050
6866 status_update_manager.cpp:184] Resuming sending status updates
6869 slave.cpp:1253] Forwarding total oversubscribed resources {}
6874 slave.cpp:4141] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.
6874 slave.cpp:904] Re-detecting master
6874 slave.cpp:947] Detecting new master
6874 status_update_manager.cpp:177] Pausing sending status updates
6869 status_update_manager.cpp:177] Pausing sending status updates
6871 slave.cpp:915] New master detected at master@192.168.1.1:5050
6871 slave.cpp:936] No credentials provided. Attempting to register without authentication
6871 slave.cpp:947] Detecting new master
6872 slave.cpp:1217] Re-registered with master master@192.168.1.1:5050
6872 slave.cpp:1253] Forwarding total oversubscribed resources {}
6871 status_update_manager.cpp:184] Resuming sending status updates
6871 slave.cpp:4141] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.

它似乎陷入了一个无限循环。你知道怎么开始新的奴隶生活吗?我试着把 work_dir 然后重新启动 mesos-slave 但没有成功。
这种情况是由于意外的更名造成的 work_dir . 重启后 mesos-slave 它无法重新连接或终止正在运行的任务。我试着用 cleanup 在从属服务器上:

echo 'cleanup' > /etc/mesos-slave/recover
service mesos-slave restart

# after recovery finishes

rm /etc/mesos-slave/recover
service mesos-slave restart

这在一定程度上有所帮助,但马拉松中仍有许多僵尸任务,因为mesos master无法检索有关该任务的任何信息。当我查看指标时,我发现有些从属服务器被标记为“不活动”。
更新:在主日志中出现以下内容:

Cannot kill task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c 
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) at
scheduler-e76665b1-de85-48a3-b9fd-5e736b64a9d8@192.168.1.10:52192
because the agent cac09818-0d75-46a9-acb1-4e17fdb9e328-S10 at 
slave(1)@192.168.1.1:5051 (w10.example.net) is disconnected. 
Kill will be retried if the agent re-registers

重启电流后 mesos-master :

Cannot kill task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c 
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon)
at scheduler-9e9753be-99ae-40a6-ab2f-ad7834126c33@192.168.1.10:39972 
because it is unknown; performing reconciliation

Performing explicit task state reconciliation for 1 tasks 
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) 
at scheduler-9e9753be-99ae-40a6-ab2f-ad7834126c33@192.168.1.10:39972

Dropping reconciliation of task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c 
for framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) 
at scheduler-9e9753be-99ae-40a6-ab2f-ad7834126c33@192.168.1.10:39972 
because there are transitional agents
mtb9vblg

mtb9vblg1#

大脑分裂的情况是由不止一个 work_dir . 在大多数情况下,将数据从错误的 work_dir :

mv /tmp/mesos/slaves/* /var/lib/mesos/slaves/

然后强制重新注册:

rm -rf /var/lib/mesos/meta/slaves/latest
service mesos-slave restart

当前正在运行的任务将无法生存(无法恢复)。来自旧执行者的任务应标记为 TASK_LOST 计划进行清理。这将避免僵尸任务的问题,mesos无法杀死僵尸任务(因为它们运行在不同的环境中) work_dir ).
如果 mesos-slave 仍注册为非活动,请重新启动当前mesos主机。

相关问题