只能从同一台机器上的spark连接到mesos

inn6fuwd  于 2021-06-21  发布在  Mesos
关注(0)|答案(1)|浏览(347)

我试图在一个Mesos星团上运行spark。
当我跑的时候 ./bin/spark-shell --master mesos://host:5050 从我运行mesos master的机器上一切正常。但是,如果我从另一台计算机运行相同的命令,则在尝试连接后,进程将挂起:

I0825 07:30:10.184141 27380 sched.cpp:126] Version: 0.19.0
I0825 07:30:10.187476 27385 sched.cpp:222] New master detected at master@192.168.0.241:5050
I0825 07:30:10.187619 27385 sched.cpp:230] No credentials provided. Attempting to register without authentication

在mesos主机上,我看到以下输出:

[...]
I0825 15:30:23.928402 23214 master.cpp:684] Giving framework 20140825-143817-4043352256-5050-23194-0002 0ns to failover
I0825 15:30:23.929033 23210 master.cpp:2849] Framework failover timeout, removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929095 23210 master.cpp:3344] Removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929687 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):6831; disk(*):455983; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-31 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.935073 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):15001; disk(*):917264; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-29 from framework   20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938248 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):6823; disk(*):455991; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-32 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938356 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):4939; disk(*):457873; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-28 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938397 23210 hierarchical_allocator_process.hpp:362] Removed framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:27.952940 23215 http.cpp:452] HTTP request for '/master/state.json'
W0825 15:30:29.595441 23208 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-32 on slave 20140822-144404-4043352256-5050-15999-32 at slave(1)@192.168.0.233:5051 (cluster2)
W0825 15:30:29.596709 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-29 on slave 20140822-144404-4043352256-5050-15999-29 at slave(1)@192.168.0.241:5051 (cluster4)
W0825 15:30:29.615630 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-31 on slave 20140822-144404-4043352256-5050-15999-31 at slave(1)@192.168.0.213:5051 (cluster3)
W0825 15:30:29.935130 23214 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-28 on slave 20140822-144404-4043352256-5050-15999-28 at slave(1)@192.168.0.212:5051 (cluster1)

当从机输出时

[...]
I0825 15:30:08.450343   980 slave.cpp:1337] Asked to shut down framework 20140825-143817-4043352256-5050-23194-0002 by master@192.168.0.241:5050
I0825 15:30:08.455153   980 slave.cpp:1362] Shutting down framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:08.455401   980 slave.cpp:2698] Shutting down executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456045   982 slave.cpp:2768] Killing executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456217   982 mesos_containerizer.cpp:992] Destroying container '37cc2b09-0e6d-4738-a837-7956367bba2b'
I0825 15:30:14.134845   977 mesos_containerizer.cpp:1108] Executor for container '37cc2b09-0e6d-4738-a837-7956367bba2b' has exited
I0825 15:30:14.135220   978 slave.cpp:2413] Executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002 has terminated with signal Killed
I0825 15:30:14.135356   978 slave.cpp:2552] Cleaning up executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135499   978 slave.cpp:2627] Cleaning up framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135627   976 status_update_manager.cpp:282] Closing status update streams for framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135571   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31/runs/37cc2b09-0e6d-4738-a837-7956367bba2b' for gc 6.99999843242074days in the future
I0825 15:30:14.135910   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31' for gc 6.99999843187556days in the future
I0825 15:30:14.135980   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002' for gc 6.99999843111111days in the future
I0825 15:31:04.450660   978 slave.cpp:2873] Current usage 60.67%. Max allowed age: 2.053113079446458days

有人见过类似的东西吗?

stszievb

stszievb1#

问题不是由网络连接问题引起的,而是由mesos从机恢复策略引起的,如下所述:http://mesos.apache.org/documentation/latest/slave-recovery/
一开始我会将从机连接到主机,然后由于一个不相关的问题将它们断开,但是当我后来再次尝试连接从机时,它们被主机丢弃了。引用上面链接的文档:
重新启动的从机应在超时时间内(当前为75秒)重新注册到主机。如果从机重新注册的时间长于此超时时间,则主机将关闭从机,从而关闭所有活动的执行器/任务。因此,强烈建议自动化重新启动从属服务器的过程(例如,使用monit)。
我把奴隶和奴隶联系起来解决了这个问题 --strict 选项设置为 false .

相关问题