Flink未启动Kubernetes上的TaskManager,作业已达到全局终端状态

jm81lzqq  于 2023-01-28  发布在  Apache
关注(0)|答案(1)|浏览(291)

我已将Flink群集部署到Kubrnetes,但只看到JobManager正在运行。
我在另一个Kubernetes集群上运行了Flink,并使用Flink Operator中的FlinkDeployment创建了保存点。保存点保存正确。然后,我将Flink应用程序部署到新的Kubernetes集群,并在FlinkDeployment中修补了保存点LocationPath。
Flink pod现在记录此错误

│ WARN  org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Ignoring JobGraph submission 'Windchill ESI Post Processing' because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.
...
│ io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: Unable to update ConfigMapLock 
...
│ Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.0.0.1/api/v1/namespaces/post-processing-int2/configmaps/post-processing-cluster-c │
│ onfig-map. Message: Operation cannot be fulfilled on configmaps "post-processing-cluster-config-map": the object has been modified; please apply your changes to the latest version and tr │
│ y again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=configmaps, name=post-processing-cluster-config-map, retryAfterSeconds=null, u │
│ id=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on configmaps "post-processing-cluster-config-map": the object has been modified; please apply your  │
│ changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, st │
│ atus=Failure, additionalProperties={}).

存在错误中提到的ConfgiMap。
我的问题是现在如何启动一个新的TaskManager?我设置了numberOfTaskSlots: 4。我尝试进入JobManager pod并运行bin/taskmanager.sh start,但这只是启动了pod中的一个进程,我觉得这不正确。然后我停止了它。
我希望看到新的TaskManager Pod启动。谢谢

r8uurelv

r8uurelv1#

线索就在日志的第一行

WARN  org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Ignoring JobGraph submission 'Windchill ESI Post Processing' because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution

我的错误从这个命令开始

kubectl patch flinkdeployment/<name-of-flink-deployment> --type=merge -p '{"spec": {"job": {"state": "suspended", "upgradeMode": "savepoint"}}}'

问题出在upgradeMode上。不应将其编辑并保留为last-state。最后一个状态使用HA状态(在我的情况下,是存储在Azure Blob存储中的状态)告知Flink部署从其停止的位置开始。savepoint将使部署处于FINISHED状态,并且不会在部署时启动新的TaskManager。
以下是正确的编辑

kubectl patch flinkdeployment/<name-of-flink-deployment> --type=merge -p '{"spec": {"job": {"state": "suspended"}}}'

相关问题