Kubernetes每小时重启一次Spark舱

kpbpu008  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(558)

我在kubernetes中以集群模式部署了spark应用程序。spark应用程序舱几乎每小时都会重启一次。驱动程序日志在重新启动前显示以下消息:

20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.

执行器日志有:

20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called

我怎样才能找到导致遗嘱执行人被删除的原因?
部署:

Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 0 max surge
Pod Template:
  Labels:       app=test
                chart=test-2.0.0
                heritage=Tiller
                product=testp
                release=test
                service=test-spark
  Containers:
   test-spark:
    Image:     test-spark:2df66df06c
    Port:       <none>
    Host Port:  <none>
    Command:
      /spark/bin/start-spark.sh
    Args:
      while true; do sleep 30; done;
    Limits:
      memory:  4Gi
    Requests:
      memory:  4Gi
    Liveness:  exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
    Environment:
      JVM_ARGS:                             -Xms256m -Xmx1g
      KUBERNETES_MASTER:                    https://kubernetes.default.svc
      KUBERNETES_NAMESPACE:                 test-spark
      IMAGE_PULL_POLICY:                    Always
      DRIVER_CPU:                           1
      DRIVER_MEMORY:                        2048m
      EXECUTOR_CPU:                         1
      EXECUTOR_MEMORY:                      2048m
      EXECUTOR_INSTANCES:                   2
      KAFKA_ADVERTISED_HOST_NAME:           kafka.default:9092
      ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS:  test-events
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   test-spark-5c5997b459 (1/1 replicas created)
Events:          <none>
0mkxixxg

0mkxixxg1#

我快速研究了在kubernetes上运行spark,似乎spark by design会在executor pod运行完spark应用程序后终止。引自spark官方网站:
当应用程序完成时,executor pod终止并被清理,但是驱动程序pod保存日志并在kubernetesapi中保持“completed”状态,直到它最终被垃圾收集或手动清理。
因此,我相信只要您的spark示例仍然能够在需要时启动executor pod,就不必担心重启。
参考文献:https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-它起作用了

mdfafbf1

mdfafbf12#

我不知道您是如何配置应用程序pod的,但是您可以使用它来停止重新启动pod,并将其包含在部署yaml文件中,这样pod将永远不会重新启动,并且您可以继续调试pod。

restartPolicy: Never

相关问题