mesos上的spark(dc/os)在做任何事情之前都会丢失任务

e0uiprwp  于 2021-06-26  发布在  Mesos
关注(0)|答案(0)|浏览(224)

我在mesos集群模式下运行dc/os apachespark时遇到了严重的问题。我使用gce示例来测试dc/os。spark一切正常,spark突然停止正常工作。
在它启动executors之前,它说一些executors丢失了,它将节点列入黑名单,只留下很少的资源,所以它运行~2-3个executors。然后它继续删除不存在的执行者,但我不知道它是否成功。
我想禁用黑名单,但它的硬编码在spark mesos文件,不可能禁用或更改任何配置。
这主要发生在限制每个执行器的内核时,使spark驱动程序在一个节点上创建很少的执行器。如果没有此设置,spark驱动程序会为每个节点创建一个执行器,驱动程序会显示这些执行器丢失,但是节点不会被列入黑名单,因为它尝试为每个节点运行一个执行器,黑名单限制为2个执行器失败。
然而,spark动态分配工作完全正常,没有错误,也没有丢失任务。
情况日志:

17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@131fcb6f{/SQL,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@878537d{/SQL/json,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c37bd27{/SQL/execution,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30331109{/SQL/execution/json,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4c27d39d{/static/sql,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave d70a5cac-18e6-4365-99cd-cf5cad357eb5-S0 due to too many failures; is Spark installed on it?
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave d70a5cac-18e6-4365-99cd-cf5cad357eb5-S3 due to too many failures; is Spark installed on it?
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 0 requested
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave d70a5cac-18e6-4365-99cd-cf5cad357eb5-S3 due to too many failures; is Spark installed on it?
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 3 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 1 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 5 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 5
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 4 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 2 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/10/04 06:13:55 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/10/04 06:13:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 6 is now TASK_RUNNING
2017-10-04 06:13:55,847:7(0x7f56dcb82700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 32ms
17/10/04 06:13:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 7 is now TASK_RUNNING

由于两个几乎为空的节点被列入黑名单,所以它使用集群中剩下的执行器运行的执行器更少。
不幸的是,我不能给出调试日志或类似的东西,因为似乎不可能使用定制的log4j文件(我遵循人们编写的操作,什么都不做)
整个日志从一开始(不获取文件)我在这里错过了什么吗

I1004 09:19:27.268869  5311 exec.cpp:162] Version: 1.4.0
I1004 09:19:27.273120  5317 exec.cpp:237] Executor registered on agent e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S2
I1004 09:19:27.273859  5315 executor.cpp:120] Registered docker executor on 10.132.0.12
I1004 09:19:27.273959  5317 executor.cpp:160] Starting task driver-20171004091925-0004
Using bootstrap to set SPARK_LOCAL_IP
2017/10/04 09:19:27 Printing new task IP: 10.132.0.12
SPARK_LOCAL_IP = 10.132.0.12
spark-env: User: root
spark-env: No kerberos KDC config found
17/10/04 09:19:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/04 09:19:30 INFO SparkContext: Running Spark version 2.2.0
17/10/04 09:19:30 INFO SparkContext: Submitted application: UserEventsReader$
17/10/04 09:19:30 INFO SecurityManager: Changing view acls to: root
17/10/04 09:19:30 INFO SecurityManager: Changing modify acls to: root
17/10/04 09:19:30 INFO SecurityManager: Changing view acls groups to: 
17/10/04 09:19:30 INFO SecurityManager: Changing modify acls groups to: 
17/10/04 09:19:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
17/10/04 09:19:30 INFO Utils: Successfully started service 'sparkDriver' on port 45577.
17/10/04 09:19:30 INFO SparkEnv: Registering MapOutputTracker
17/10/04 09:19:31 INFO SparkEnv: Registering BlockManagerMaster
17/10/04 09:19:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/10/04 09:19:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/10/04 09:19:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-a1289f49-471f-4b35-9f23-7b3ba2ad1b90
17/10/04 09:19:31 INFO MemoryStore: MemoryStore started with capacity 912.3 MB
17/10/04 09:19:31 INFO SparkEnv: Registering OutputCommitCoordinator
17/10/04 09:19:31 INFO log: Logging initialized @3693ms
17/10/04 09:19:31 INFO Server: jetty-9.3.z-SNAPSHOT
17/10/04 09:19:31 INFO Server: Started @3881ms
17/10/04 09:19:31 INFO AbstractConnector: Started ServerConnector@5707c1cb{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
17/10/04 09:19:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@fac80{/jobs,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@48e64352{/jobs/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4362d7df{/jobs/job,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@200606de{/jobs/job/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@f8908f6{/stages,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2ef8a8c3{/stages/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@63fd4873{/stages/stage,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7957dc72{/stages/stage/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3aacf32a{/stages/pool,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@82c57b3{/stages/pool/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@600b0b7{/storage,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5ea502e0{/storage/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@473b3b7a{/storage/rdd,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@77b7ffa4{/storage/rdd/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@402f80f5{/environment,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@133e019b{/environment/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7dac3fd8{/executors,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2102a4d5{/executors/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d4d3fe7{/executors/threadDump,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@51684e4a{/executors/threadDump/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@38875e7d{/static,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a6f5124{/,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@ec2bf82{/api,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@669253b7{/jobs/job/kill,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@51a06cbe{/stages/stage/kill,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO SparkUI: Bound SparkUI to 10.132.0.12, and started at http://10.132.0.12:4040
17/10/04 09:19:31 INFO SparkContext: Added JAR file:/mnt/mesos/sandbox/SparkWindow-1.0.jar at spark://10.132.0.12:45577/jars/SparkWindow-1.0.jar with timestamp 1507108771967
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@730: Client environment:host.name=private-2
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@738: Client environment:os.arch=3.10.0-514.26.2.el7.x86_64
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Tue Jul 4 15:04:05 UTC 2017
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@767: Client environment:user.dir=/mnt/mesos/sandbox
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=master.mesos:2181 sessionTimeout=10000 watcher=0x7f906202f4d0 sessionId=0 sessionPasswd=<null> context=0x7f909400bed0 flags=0
2017-10-04 09:19:32,658:7(0x7f90539de700):ZOO_INFO@check_events@1728: initiated connection to server [10.132.0.10:2181]
I1004 09:19:32.658676    93 sched.cpp:232] Version: 1.4.0
2017-10-04 09:19:32,663:7(0x7f90539de700):ZOO_INFO@check_events@1775: session establishment complete on server [10.132.0.10:2181], sessionId=0x15ee68b946c0063, negotiated timeout=10000
I1004 09:19:32.664188    87 group.cpp:341] Group process (zookeeper-group(1)@10.132.0.12:34557) connected to ZooKeeper
I1004 09:19:32.664244    87 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1004 09:19:32.664261    87 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
I1004 09:19:32.665998    87 detector.cpp:152] Detected a new leader: (id='32')
I1004 09:19:32.666154    87 group.cpp:700] Trying to get '/mesos/json.info_0000000032' in ZooKeeper
I1004 09:19:32.670025    88 zookeeper.cpp:262] A new leading master (UPID=master@10.132.0.10:5050) is detected
I1004 09:19:32.670140    88 sched.cpp:336] New master detected at master@10.132.0.10:5050
I1004 09:19:32.670532    88 sched.cpp:352] No credentials provided. Attempting to register without authentication
I1004 09:19:32.676714    85 sched.cpp:759] Framework registered with ed329c1d-e021-4b11-a9c3-c9bf02c26921-0000-driver-20171004091925-0004
17/10/04 09:19:32 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40838.
17/10/04 09:19:32 INFO NettyBlockTransferService: Server created on 10.132.0.12:40838
17/10/04 09:19:32 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/10/04 09:19:32 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.132.0.12:40838 with 912.3 MB RAM, BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:32 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:32 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@66bfd864{/metrics/json,null,AVAILABLE,@Spark}
17/10/04 09:19:33 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
17/10/04 09:19:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/mnt/mesos/sandbox/spark-warehouse').
17/10/04 09:19:34 INFO SharedState: Warehouse path is 'file:/mnt/mesos/sandbox/spark-warehouse'.
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@878537d{/SQL,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@29fc1a2b{/SQL/json,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30331109{/SQL/execution,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@74fe5966{/SQL/execution/json,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7bde1f3a{/static/sql,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S2 due to too many failures; is Spark installed on it?
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 6 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S0 due to too many failures; is Spark installed on it?
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S0 due to too many failures; is Spark installed on it?
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 0 requested
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 4 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 1 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 6 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 6
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 5 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 5
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 2 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 3 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
2017-10-04 09:19:36,041:7(0x7f90539de700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 43ms
17/10/04 09:19:36 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 7 is now TASK_RUNNING
17/10/04 09:19:36 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/10/04 09:19:36 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 8 is now TASK_RUNNING
17/10/04 09:19:36 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题