spark不断地将任务提交给死掉的执行者

nimxete2  于 2021-05-24  发布在  Spark
关注(0)|答案(0)|浏览(719)

我正在开发apachespark,我面临一个非常奇怪的问题。其中一个执行器失败是因为oom和它的关闭钩子清除了所有存储(内存和磁盘),但是很明显,驱动程序由于处理本地任务而一直在同一个执行器上提交失败的任务。
现在,该计算机上的存储已清除,所有重试的任务也会失败,从而导致整个阶段失败(在4次重试之后)
我不明白的是,驱动程序怎么可能不知道执行程序处于关闭状态,无法执行任何任务。
配置:
心跳间隔:60秒
网络超时:600s
日志以确认executor正在接受关机后的任务

20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] Executor: Exception in task 6391.0 in stage 17.0 (TID 138513)
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 138513,5,main]
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 INFO [pool-8-thread-1] DiskBlockManager: Shutdown hook called
20/09/29 20:26:35 ERROR [Executor task launch worker for task 138295] Executor: Exception in task 6239.0 in stage 17.0 (TID 138295)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/35/temp_shuffle_b5df90ac-78de-48e3-9c2d-891f8b2ce1fa (No such file or directory)
20/09/29 20:26:36 ERROR [Executor task launch worker for task 139484] Executor: Exception in task 6587.0 in stage 17.0 (TID 139484)
org.apache.spark.SparkException: Block rdd_3861_6587 was not found even though it's read-locked
20/09/29 20:26:42 WARN [Thread-2] ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_172]
20/09/29 20:26:44 ERROR [Executor task launch worker for task 140256] Executor: Exception in task 6576.3 in stage 17.0 (TID 140256)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/30/rdd_3861_6576 (No such file or directory)
20/09/29 20:26:44 INFO [dispatcher-event-loop-0] Executor: Executor is trying to kill task 6866.1 in stage 17.0 (TID 140329), reason: stage cancelled
20/09/29 20:26:47 INFO [pool-8-thread-1] ShutdownHookManager: Shutdown hook called
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: removing client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: Stopping client
20/09/29 20:26:47 DEBUG [Thread-2] ShutdownHookManager: ShutdownHookManger complete shutdown
20/09/29 20:26:55 INFO [dispatcher-event-loop-14] CoarseGrainedExecutorBackend: Got assigned task 141510
20/09/29 20:26:55 INFO [Executor task launch worker for task 141510] Executor: Running task 545.1 in stage 26.0 (TID 141510)

(我已经删减了堆栈跟踪,因为它们只是spark rdd shuffle read方法)
如果我们检查时间戳,那么关机开始于 20/09/29 20:26:32 结束于 20/09/29 20:26:47 ,在此期间,驱动程序将所有重试的任务发送到同一执行器,但都失败,导致阶段取消。
有人能帮我理解这种行为吗?如果还需要什么,请告诉我

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题