kubernetes (Gitlab runner)Pod卡在“终止”状态

pobjuy32  于 2023-06-21  发布在  Kubernetes
关注(0)|答案(1)|浏览(183)

我已经升级了一个EKS集群到1.23,显然,从那以后,所有的Gitlab runner Pod(K8S执行器)都不会真正终止,而是永远停留在“终止”状态。如果我使用kubectl显式删除它们,它们就会“离开”。
一些观察和事实:

  • pod没有附加卷
  • K9 S显示Pod中的容器处于“ERROR”状态,但没有给出关于 * 为什么 * 的指示。
  • 没有附加终结器(我在终止Pod上看到的终结器foregroundDeletion,似乎只是表明该Pod正在被删除
  • GitLab不会在Pod日志和UI中显示错误;配置项作业成功完成

很多讨论都暗示终结者,但是再次:没有附加的,而且,正如我所理解的,只有移除终结器才会使它们最终终止并消失,这不是我看到的行为。我还找到了this SO question,但它并没有帮助我诊断或解决问题。
我真的不知道该去哪里继续找。除了GitLab runners之外,一切似乎都运行得很好,但它们的配置没有改变。
有什么能帮忙的吗?非常感谢。

**更新:**我现在意识到,显然pod在底层机器上消失了,至少在docker ps中没有痕迹。这一晚,两个主机轮换,一堆Pod仍然显示在这些主机上(这些主机不再存在)。

更加不解了。

**UPDATE II:**成功从Terminated容器的底层主机抓取/var/log/messages日志,如下所示。

# /var/log/messages | grep ...
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.823617    3693 kubelet.go:2120] "SyncLoop ADD" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946041    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946155    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946217    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946241    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946279    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946307    3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047362    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047460    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047533    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047570    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047643    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047699    3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047815    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048095    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048277    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.050600    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.061048    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.062031    3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.159234    3693 kuberuntime_manager.go:487] "No sandbox for pod can be found. Need to start a new one" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar  8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.835414    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873037    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873611    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:98f1518e3d2f4ba7d66e1fd5238e49e928f81c5fe044e7bcbf408ec98ff7e45c}
Mar  8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926561    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar  8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926604    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.802812    3693 kubelet.go:2136] "SyncLoop DELETE" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.804708    3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="build" containerID="docker://7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d" gracePeriod=1
Mar  8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.805281    3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="helper" containerID="docker://c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc" gracePeriod=1
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705701    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705852    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar  8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705876    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar  8 09:37:41 ip-10-0-136-6 kubelet: I0308 09:37:41.984526    3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7co
yh2wf1be

yh2wf1be1#

在将Calico的tigera-operator升级到v3.26.0(installed with the aws-cni as recommended by Amazon)之后,我在EKS 1.24上遇到了完全相同的问题和相同的症状。
通过查看集群的kube-state-metrics,我能够确定tigera-operator升级是罪魁祸首。通过以下查询:

kube_pod_status_phase{phase="Failed", namespace="gitlab-runner"}

升级后,处于失败状态的Pod数量大幅上升。假设您通过calico在集群中启用了网络策略,这可能就是原因。
将tigera-operator降级为v3.25.1完全解决了这个问题。确保等待calico-node守护进程完全更新,因为无需手动操作,pod将逐个重新启动。
有一个相关的gitlab issue,可以进一步分析这个问题。在确定calico/tigera-operator是罪魁祸首之前,我尝试了从0.52.1到0.47.0的所有次要(和一些补丁)版本的gitlab-runners图表,甚至审查了更改,但这绝对不是行为的原因。这似乎是一个上游calico bug,它是由gitlab runner pod的终止方式触发的。

相关问题