我将mesos1.1.0集群用于spark批处理和spark流处理作业。批处理在使用相同的spark uri-s时没有任何问题,流式处理有时表现异常。通过运行简单的bash提交脚本,使用marathon1.3.6启动流式处理作业。脚本在所有从属服务器上都是本地的,在所有从属服务器上都是相同的。
脚本:
spark-submit --class "$1" --master mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf spark.executor.uri=/spark/spark-2.0.1-bin-hadoop2.6.tgz --conf spark.cores.max=2 --conf spark.eventLog.enabled=false /spark_jobs/"$2"
这一切大部分时间都很好,马拉松部署工作,工作开始,并保持活跃和工作。这是一个开始很好的工作的mesos标准样品。这只是日志文件的开始,因为我认为它是问题出现的地方。
#################### 正常启动
I1224 09:02:00.480056 23118 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/spark\/spark-2.0.1-bin-hadoop2.6.tgz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3\/frameworks\/cd6497c0-d374-4957-b861-286fd0b27587-13920\/executors\/0\/runs\/feb948b2-1b37-4962-a682-77aeb45a3063","user":"root"}
I1224 09:02:00.483033 23118 fetcher.cpp:409] Fetching URI '/spark/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.483050 23118 fetcher.cpp:250] Fetching directly into the sandbox directory
I1224 09:02:00.483072 23118 fetcher.cpp:187] Fetching URI '/spark/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.483088 23118 fetcher.cpp:167] Copying resource with command:cp '/spark/spark-2.0.1-bin-hadoop2.6.tgz' '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.675047 23118 fetcher.cpp:84] Extracting with command: tar -C '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063' -xf '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:02.797693 23118 fetcher.cpp:92] Extracted '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz' into '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063'
I1224 09:02:02.797751 23118 fetcher.cpp:547] Fetched '/spark/spark-2.0.1-bin-hadoop2.6.tgz' to '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:02.850260 23144 exec.cpp:162] Version: 1.1.0
I1224 09:02:02.852879 23147 exec.cpp:237] Executor registered on agent bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/24 09:02:03 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 23155@mr02
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for TERM
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for HUP
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for INT
16/12/24 09:02:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/24 09:02:04 INFO SecurityManager: Changing view acls to: root
16/12/24 09:02:04 INFO SecurityManager: Changing modify acls to: root
16/12/24 09:02:04 INFO SecurityManager: Changing view acls groups to:
16/12/24 09:02:04 INFO SecurityManager: Changing modify acls groups to:
16/12/24 09:02:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/24 09:02:04 INFO TransportClientFactory: Successfully created connection to /10.210.197.151:57731 after 82 ms (0 ms spent in bootstraps)
有时,当我们重新启动这个作业时,作业被初始化,它在mesos中注册为框架,它启动spark web ui,但它什么也不做,所有任务都排队,它正在做一些奇怪的事情(计算什么的阶段?)。有时需要多次重启才能使其正常工作,这不是期望的行为,因为过程应该是自动化的。
异常启动
I1224 08:32:11.346411 19741 exec.cpp:162] Version: 1.1.0
I1224 08:32:11.348379 19743 exec.cpp:237] Executor registered on agent bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@730: Client environment:host.name=mr02
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@738: Client environment:os.arch=3.13.0-100-generic
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@739: Client environment:os.version=#147-Ubuntu SMP Tue Oct 18 16:48:51 UTC 2016
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@767: Client environment:user.dir=/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/aee5cbd3-b3e8-431d-bc18-fb5107d54074-0000/executors/jobs_storagetraffic17.76d13635-c9b3-11e6-bf95-bc764e104d04/runs/cd1a62ea-4cda-4599-a6fc-e7c5d1ad3956
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=mr01:2181,mr02:2181,mr03:2181 sessionTimeout=10000 watcher=0x7f8872f72200 sessionId=0 sessionPasswd=<null> context=0x7f883c000b10 flags=0
I1224 08:32:13.839673 19843 sched.cpp:226] Version: 1.1.0
2016-12-24 08:32:13,840:19753(0x7f88577fe700):ZOO_INFO@check_events@1728: initiated connection to server [10.210.224.189:2181]
2016-12-24 08:32:13,845:19753(0x7f88577fe700):ZOO_INFO@check_events@1775: session establishment complete on server [10.210.224.189:2181], sessionId=0x3592d18d17a001a, negotiated timeout=10000
I1224 08:32:13.845521 19838 group.cpp:340] Group process (zookeeper-group(1)@10.210.198.20:43514) connected to ZooKeeper
I1224 08:32:13.845562 19838 group.cpp:828] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1224 08:32:13.845579 19838 group.cpp:418] Trying to create path '/mesos' in ZooKeeper
I1224 08:32:13.846849 19838 detector.cpp:152] Detected a new leader: (id='27')
I1224 08:32:13.846951 19838 group.cpp:697] Trying to get '/mesos/json.info_0000000027' in ZooKeeper
I1224 08:32:13.847759 19841 zookeeper.cpp:259] A new leading master (UPID=master@10.210.224.189:5050) is detected
I1224 08:32:13.847839 19838 sched.cpp:330] New master detected at master@10.210.224.189:5050
I1224 08:32:13.848011 19838 sched.cpp:341] No credentials provided. Attempting to register without authentication
I1224 08:32:13.850549 19841 sched.cpp:743] Framework registered with cd6497c0-d374-4957-b861-286fd0b27587-13914
16/12/24 08:32:14 INFO utils.VerifiableProperties: Verifying properties
16/12/24 08:32:14 INFO utils.VerifiableProperties: Property group.id is overridden to
16/12/24 08:32:14 INFO utils.VerifiableProperties: Property zookeeper.connect is overridden to
16/12/24 08:32:20 INFO utils.VerifiableProperties: Verifying properties
16/12/24 08:32:20 INFO utils.VerifiableProperties: Property group.id is overridden to
16/12/24 08:32:20 INFO utils.VerifiableProperties: Property zookeeper.connect is overridden to
[Stage 0:> (0 + 0) / 31]
[Stage 0:> (0 + 1) / 31]
[Stage 0:> (0 + 2) / 31]
[Stage 0:=> (1 + 3) / 31]
[Stage 0:==================================> (19 + 0) / 31]
[Stage 1:==================================> (19 + 0) / 31]
[Stage 2:==================================> (19 + 0) / 31]
[Stage 3:==================================> (19 + 0) / 31]
[Stage 4:==================================> (19 + 0) / 31]
[Stage 5:==================================> (19 + 0) / 31]
[Stage 6:==================================> (19 + 0) / 31]
[Stage 7:==================================> (19 + 0) / 31]
[Stage 8:==================================> (19 + 0) / 31]
[Stage 9:==================================> (19 + 0) / 31]
[Stage 10:==================================> (19 + 0) / 31]
[Stage 11:==================================> (19 + 0) / 31]
[Stage 12:==================================> (19 + 0) / 31]
[Stage 13:==================================> (19 + 0) / 31]
[Stage 13:==================================> (19 + 2) / 31]
[Stage 14:==================================> (19 + 0) / 31]
[Stage 14:=====================================> (21 + 2) / 31]
[Stage 15:==================================> (19 + 0) / 31]
[Stage 15:=====================================> (21 + 2) / 31]
[Stage 16:==================================> (19 + 0) / 31]
[Stage 16:=========================================> (23 + 2) / 31]
[Stage 17:==================================> (19 + 0) / 31]
[Stage 17:=============================================> (25 + 2) / 31]
[Stage 18:==================================> (19 + 0) / 31]
[Stage 18:=============================================> (25 + 2) / 31]
[Stage 19:==================================> (19 + 0) / 31]
[Stage 19:=============================================> (25 + 2) / 31]
[Stage 20:==================================> (19 + 0) / 31]
[Stage 20:===========================================> (24 + 2) / 31]
[Stage 21:==================================> (19 + 0) / 31]
[Stage 22:==================================> (19 + 0) / 31]
[Stage 23:==================================> (19 + 0) / 31]
[Stage 24:==================================> (19 + 0) / 31]
[Stage 25:==================================> (19 + 0) / 31]
[Stage 26:==================================> (19 + 0) / 31]
[Stage 27:==================================> (19 + 0) / 31]
[Stage 28:==================================> (19 + 0) / 31]
[Stage 29:==================================> (19 + 0) / 31]
[Stage 30:==================================> (19 + 0) / 31]
[Stage 31:==================================> (19 + 0) / 31]
[Stage 32:==================================> (19 + 0) / 31]
[Stage 33:==================================> (19 + 0) / 31]
[Stage 34:==================================> (19 + 0) / 31]
[Stage 35:==================================> (19 + 0) / 31]
[Stage 36:==================================> (19 + 0) / 31]
[Stage 37:==================================> (19 + 0) / 31]
[Stage 38:==================================> (19 + 0) / 31]
[Stage 39:==================================> (19 + 0) / 31]
我怀疑问题出在日志文件的前几行,由于某种原因,异常作业没有fetcher部分,可能它没有复制/spark/spark-2.0.1-bin-hadoop2.6.tgz,并且没有二进制文件启动…但是,它如何启动spark ui?
有没有人知道如何解决这个问题,在哪里寻找额外的线索,测试什么来获得更多的数据?
谢谢,
伊凡
暂无答案!
目前还没有任何答案,快来回答吧!