我已经升级了一个EKS集群到1.23,显然,从那以后,所有的Gitlab runner Pod(K8S执行器)都不会真正终止,而是永远停留在“终止”状态。如果我使用kubectl
显式删除它们,它们就会“离开”。
一些观察和事实:
- pod没有附加卷
- K9 S显示Pod中的容器处于“ERROR”状态,但没有给出关于 * 为什么 * 的指示。
- 没有附加终结器(我在终止Pod上看到的终结器
foregroundDeletion
,似乎只是表明该Pod正在被删除 - GitLab不会在Pod日志和UI中显示错误;配置项作业成功完成
很多讨论都暗示终结者,但是再次:没有附加的,而且,正如我所理解的,只有移除终结器才会使它们最终终止并消失,这不是我看到的行为。我还找到了this SO question,但它并没有帮助我诊断或解决问题。
我真的不知道该去哪里继续找。除了GitLab runners之外,一切似乎都运行得很好,但它们的配置没有改变。
有什么能帮忙的吗?非常感谢。
**更新:**我现在意识到,显然pod在底层机器上消失了,至少在docker ps
中没有痕迹。这一晚,两个主机轮换,一堆Pod仍然显示在这些主机上(这些主机不再存在)。
更加不解了。
**UPDATE II:**成功从Terminated
容器的底层主机抓取/var/log/messages
日志,如下所示。
# /var/log/messages | grep ...
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.823617 3693 kubelet.go:2120] "SyncLoop ADD" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946041 3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946155 3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946217 3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946241 3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946279 3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:45 ip-10-0-136-6 kubelet: I0308 09:30:45.946307 3693 reconciler.go:238] "operationExecutor.VerifyControllerAttachedVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047362 3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047460 3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047533 3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047570 3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047643 3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047699 3693 reconciler.go:293] "operationExecutor.MountVolume started for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.047815 3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"scripts\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-scripts\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048095 3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"repo\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-repo\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.048277 3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"logs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-logs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.050600 3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"docker-certs\" (UniqueName: \"kubernetes.io/empty-dir/08eb951f-8b11-44b2-b106-42719ecff1ba-docker-certs\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.061048 3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-aws-iam-token\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.062031 3693 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"kube-api-access-2xzgj\" (UniqueName: \"kubernetes.io/projected/08eb951f-8b11-44b2-b106-42719ecff1ba-kube-api-access-2xzgj\") pod \"runner-szk7coz8-project-12345678-concurrent-0dwfcq\" (UID: \"08eb951f-8b11-44b2-b106-42719ecff1ba\") " pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.159234 3693 kuberuntime_manager.go:487] "No sandbox for pod can be found. Need to start a new one" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq"
Mar 8 09:30:46 ip-10-0-136-6 kubelet: I0308 09:30:46.835414 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar 8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873037 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar 8 09:30:47 ip-10-0-136-6 kubelet: I0308 09:30:47.873611 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:98f1518e3d2f4ba7d66e1fd5238e49e928f81c5fe044e7bcbf408ec98ff7e45c}
Mar 8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926561 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar 8 09:30:48 ip-10-0-136-6 kubelet: I0308 09:30:48.926604 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerStarted Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar 8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.802812 3693 kubelet.go:2136] "SyncLoop DELETE" source="api" pods=[gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq]
Mar 8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.804708 3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="build" containerID="docker://7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d" gracePeriod=1
Mar 8 09:37:34 ip-10-0-136-6 kubelet: I0308 09:37:34.805281 3693 kuberuntime_container.go:723] "Killing container with a grace period" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" podUID=08eb951f-8b11-44b2-b106-42719ecff1ba containerName="helper" containerID="docker://c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc" gracePeriod=1
Mar 8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705701 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:c3ffd5d9b1cf640b9af95efb271aeddc2ee1f96d84fa8e118d8ce778e33bbfdc}
Mar 8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705852 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:7a2c23f5a512d782af165b0a5d95dd55823202aef31cdc56998832a39b60017d}
Mar 8 09:37:36 ip-10-0-136-6 kubelet: I0308 09:37:36.705876 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7coz8-project-12345678-concurrent-0dwfcq" event=&{ID:08eb951f-8b11-44b2-b106-42719ecff1ba Type:ContainerDied Data:258f2671f11654204119270288364af695c332dc109c16e4eeb92ea8fed8a82c}
Mar 8 09:37:41 ip-10-0-136-6 kubelet: I0308 09:37:41.984526 3693 kubelet.go:2158] "SyncLoop (PLEG): event for pod" pod="gitlab/runner-szk7co
1条答案
按热度按时间yh2wf1be1#
在将Calico的tigera-operator升级到v3.26.0(installed with the aws-cni as recommended by Amazon)之后,我在EKS 1.24上遇到了完全相同的问题和相同的症状。
通过查看集群的kube-state-metrics,我能够确定tigera-operator升级是罪魁祸首。通过以下查询:
升级后,处于失败状态的Pod数量大幅上升。假设您通过calico在集群中启用了网络策略,这可能就是原因。
将tigera-operator降级为
v3.25.1
完全解决了这个问题。确保等待calico-node
守护进程完全更新,因为无需手动操作,pod将逐个重新启动。有一个相关的gitlab issue,可以进一步分析这个问题。在确定calico/tigera-operator是罪魁祸首之前,我尝试了从0.52.1到0.47.0的所有次要(和一些补丁)版本的gitlab-runners图表,甚至审查了更改,但这绝对不是行为的原因。这似乎是一个上游calico bug,它是由gitlab runner pod的终止方式触发的。