发生了什么?
pod(容器)的就绪探针和存活探针是非阻塞例程。如果就绪探针失败,存活探针可以触发重启并可能自我修复。
然而,遇到了一个情况:
- coredns pod启动,但外部自动化导致节点上的IP被移除。cni IPAM被迫同步资源状态,coredns pod网络ns被拆分重建 - 容器ID发生变化,但pod ID保持不变
Feb 28 16:33:53 ... kubelet.go:2456] "SyncLoop (PLEG): event for pod" pod="kube-system/coredns-f88c6698d-zmjgk" event={"ID":"001d42a0-1729-44fc-9959-b6e751ee44d9","Type":"ContainerStarted","Data":"9e324b5e26ac15640355f3dd86bcdb80f81f380827b435abc85365cd67fcc1f2"}
Feb 28 16:33:54 ... kubelet.go:2456] "SyncLoop (PLEG): event for pod" pod="kube-system/coredns-f88c6698d-zmjgk" event={"ID":"001d42a0-1729-44fc-9959-b6e751ee44d9","Type":"ContainerStarted","Data":"86d1a5bcff3978fcbaae12fc6259adade96747cc14c3aa3374102d41b34c1636"}
- coredns中没有startUp探针,因此容器已准备好,并发送doProbe就绪探针
Feb 28 16:33:54 ... kubelet.go:2528] "SyncLoop (probe)" probe="readiness" status="" pod="kube-system/coredns-f88c6698d-zmjgk"
- 这个http探针因http状态码503而失败,且未发出存活探针,从而触发自我修复/重启
Feb 28 16:33:54 ... prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/coredns-f88c6698d-zmjgk" podUID="001d42a0-1729-44fc-9959-b6e751ee44d9" containerName="coredns" probeResult="failure" output="HTTP probe failed with statuscode: 503"
Feb 28 16:33:56 ... prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/coredns-f88c6698d-zmjgk" podUID="001d42a0-1729-44fc-9959-b6e751ee44d9" containerName="coredns" probeResult="failure" output="HTTP probe failed with statuscode: 503"
Feb 29 00:18:22 ... prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/coredns-f88c6698d-zmjgk" podUID="001d42a0-1729-44fc-9959-b6e751ee44d9" containerName="coredns" probeResult="failure" output="HTTP probe failed with statuscode: 503"
- 只是不清楚为什么coredns规范上的存活探针从未被发送
是getWorker在这里在检查启动探针后更新PodStatus,而不是引入无意中的等待就绪吗?
您期望发生什么?
- kubernetes对卡住/失败探针的pod进行自我修复尝试
我们如何尽可能精确地重现它?
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness
image: registry.k8s.io/liveness
args:
- /server
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 60 || 300
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
- 上面的测试pod
- 在启动和就绪探针之后(不重要),但在存活探针之前,将分配给pod的IP从外部移除
- 强制节点IPAM重新同步,遇到类似的错误;
Mar 06 20:18:37 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
Mar 06 20:18:47 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
Mar 06 20:18:57 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
Mar 06 20:19:07 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
我们需要了解其他信息吗?
- 无响应*
Kubernetes版本
$ kubectl version
Client Version: v1.29.1
Kustomize Version: v5.0.4...
Server Version: v1.29.1...-eks-...
云提供商
EKS
OS版本
# On Linux:
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
$ uname -a
inux ....compute.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
6条答案
按热度按时间eufgjt7s1#
这个问题目前正在等待分类。
如果SIG或子项目确定这是一个相关的问题,他们将通过应用
triage/accepted
标签并提供进一步的指导来接受它。组织成员可以通过在评论中写入
/triage accepted
来添加triage/accepted
标签。有关使用PR评论与我互动的说明,请查看here。如果您对我的行为有任何问题或建议,请针对kubernetes/test-infra仓库提出一个问题。
ycl3bljg2#
/sig Node
vxqlmq5t3#
@AbeOwlu , to reproduce this issue, how to remove the IP assigned to a pod externally and force node IPAM to re-sync? Is this an issue with AWS VPC CNI?
/triage needs-information
lp0sw83n4#
你好,@AnishShah ,感谢你关注这个问题。
calicoctl ipam release --force
进行测试,这可能是一个类似的状态...应该确认这一点并在不久的将来更新更多信息。3zwjbxry5#
Kubernetes项目目前缺乏足够的贡献者来充分应对所有问题。
此机器人根据以下规则对未分类的问题进行分级处理:
lifecycle/stale
应用后的90天不活动后,将应用lifecycle/stale
lifecycle/stale
应用后的30天不活动后,将应用lifecycle/rotten
lifecycle/rotten
应用后的30天不活动后,该问题将被关闭您可以:
/remove-lifecycle stale
/close
关闭此问题请将反馈发送至sig-contributor-experience@kubernetes/community。
/lifecycle stale
ryoqjall6#
Kubernetes项目目前缺乏足够的活跃贡献者来充分应对所有问题。
此机器人根据以下规则对未分类的问题进行分级处理:
lifecycle/stale
应用后的90天内无活动,将应用lifecycle/stale
lifecycle/stale
应用后的30天内无活动,将应用lifecycle/rotten
lifecycle/rotten
应用后的30天内无活动,将关闭该问题您可以:
/remove-lifecycle rotten
标记此问题为新鲜/close
关闭此问题请将反馈发送至sig-contributor-experience@kubernetes/community。
/lifecycle rotten