kubernetes 一个仍在运行的postStart生命周期钩子已被删除的pod,即使在终止后，也不会被终止,

ggazkfy8 于 5个月前发布在 Kubernetes

关注(0)|答案(8)|浏览(110)

发生了什么？
我们为其中一个pod引入了一个postStart钩子。最终，当我们尝试删除它们时，我们注意到这些pod挂在“Terminating”状态上，无限期地处于这个状态。
尝试执行kubectl logs,得到Error from server (BadRequest): container "reproducer" in pod "reproducer" is waiting to start: ContainerCreating。这是第一个提示postStart钩子是问题(因为一个pod在postStart钩子退出之前被认为是PodInitializing)。
我们可以正常地进入容器，在那里我们发现我们的postStart钩子确实卡住了。杀死postStart钩子的所有进程使pod终止。
查看我们PID 1进程的日志(幸运的是，我们也将其写入文件...)没有收到SIGTERM的迹象(我们在该应用程序中有一个关闭钩子，如果调用它，它会记录日志)。
根据我们的观察，我们得出结论：尽管pod还没有完成postStart钩子，但K8s不会向容器的PID 1进程发送SIGTERM。更糟糕的是，即使在terminationGracePeriod过期后，也没有发生kill操作，导致了观察到的行为。
你期望会发生什么？
pod终止 - 如果不是通过向PID 1进程发送SIGTERM来实现的，那么至少是在terminationGracePeriod过期后进行kill操作。
我们如何尽可能精确地重现它？
应用以下pod规范：

apiVersion: v1
kind: Pod
metadata:
  name: reproducer
spec:
  containers:
    - command:
        - /bin/bash
        - -c
        - |
          trap "echo SIGTERM received, setting exit file | tee -a /log.txt; touch /exit" SIGTERM;
          while [ ! -f "/exit" ]; do
            echo "$(date) Faking work" | tee -a /log.txt;
            sleep 10;
          done

      image: ubuntu
      imagePullPolicy: IfNotPresent
      lifecycle:
        postStart:
          exec:
            command:
              - /bin/bash
              - -c
              - | 
                while [ ! -f "/postStartDone" ]; do 
                  sleep 5;  
                done; 
                echo "poststart hook done" | tee -a /log.txt

      name: reproducer
  terminationGracePeriodSeconds: 30

(注意，我们明确将terminationGracePeriodSeconds设置为默认值30以保持清晰。)
在一个终端中运行watch -d kubectl get pods。
应用后，pod会出现，但由于postStart钩子不会退出，pod将停留在ContainerCreating。
确认您无法使用kubectl logs reproducer查看日志(错误消息Error from server (BadRequest): container "reproducer" in pod "reproducer" is waiting to start: ContainerCreating)
使用kubectl exec -it reproducer -- /bin/bash进入pod
确认根进程和post start钩子都在运行，使用ps aux:

root@reproducer:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4360  3320 ?        Ss   09:55   0:00 /bin/bash -c trap "echo SIGTERM received, setting exit file | tee -a /log.txt; touch /exit" SIGTERM; while [ ! -f "/exit"
root          11  0.0  0.0   4360  3364 ?        Ss   09:55   0:00 /bin/bash -c while [ ! -f "/postStartDone" ]; do    sleep 5;   done;  echo "poststart hook done" | tee -a /log.txt
root          19  0.0  0.0   4628  3796 pts/0    Ss   09:55   0:00 /bin/bash
root          31  0.0  0.0   2788  1096 ?        S    09:55   0:00 sleep 10
root          32  0.0  0.0   2788  1108 ?        S    09:55   0:00 sleep 5
root          33  0.0  0.0   7060  1592 pts/0    R+   09:55   0:00 ps aux

退出。
使用kubectl delete reproducer删除pod。
观察到pod从Container Creating变为Terminating并无限期地保持在该状态(即使在30秒的termination grace period过后也是如此)。
再次进入pod,运行tail -f log.txt,观察日志输出...您应该只看到行Faking work。
在另一个shell中创建一个文件，使postStart钩子成功退出，使用kubectl exec reproducer -- /bin/touch /postStartDone
poststart hook done将被记录下来，最终pod将终止。
我们需要知道其他任何事情吗？
无响应

kubernetes

来源：https://github.com/kubernetes/kubernetes/issues/116032

8条答案

按热度按时间

00jrzges1#

/sig node

赞(0）回复(0）举报 4个月前

anauzrmj2#

相关问题 #113606 .
Pod worker无法取消上下文。
kubernetes/pkg/kubelet/pod_workers.go
第776行 7efa62d
| | status.cancelFn() |
kubernetes/pkg/kubelet/kubelet.go
第1606行到1609行 7efa62d
| | func (klKubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPodv1.Pod, podStatus*kubecontainer.PodStatus) (isTerminalbool, errerror) { |
| | // TODO(#113606): 将此与来自pod worker的传入上下文参数连接起来，|
| | // 目前，使用该上下文会导致测试失败。|
| | ctx:=context.TODO() |

赞(0）回复(0）举报 4个月前

lyfkaqu13#

#113606 块应该解决这个问题。

赞(0）回复(0）举报 4个月前

jdgnovmf4#

/cc

赞(0）回复(0）举报 4个月前

2skhul335#

/triage accepted
/assign @smarterclayton

赞(0）回复(0）举报 4个月前

zqry0prt6#

这个问题已经超过一年没有更新了，应该重新进行优先级评估。
你可以：

确认这个问题是否仍然与 /triage accepted (仅限组织成员)相关
使用 /close 关闭此问题

有关优先级评估过程的更多详细信息，请参阅 https://www.kubernetes.dev/docs/guide/issue-triage/
已接受移除优先级评估

赞(0）回复(0）举报 4个月前

laawzig27#

Kubernetes项目目前缺乏足够的贡献者来充分应对所有问题。
此机器人根据以下规则对未分类的问题进行分级处理：

在lifecycle/stale应用后的90天内无活动，将应用lifecycle/stale
在lifecycle/stale应用后的30天内无活动，将应用lifecycle/rotten
在lifecycle/rotten应用后的30天内无活动，将关闭该问题

您可以：

使用/remove-lifecycle stale标记此问题为新鲜
使用/close关闭此问题
提供帮助，使用Issue Triage

请将反馈发送至sig-contributor-experience@kubernetes/community。
/lifecycle stale

赞(0）回复(0）举报 4个月前

evrscar28#

Kubernetes项目目前缺乏足够的活跃贡献者来充分应对所有问题。
此机器人根据以下规则对未分类的问题进行分级处理：

在lifecycle/stale应用后的90天内无活动，将应用lifecycle/stale
在lifecycle/stale应用后的30天内无活动，将应用lifecycle/rotten
在lifecycle/rotten应用后的30天内无活动，将关闭该问题

您可以：

使用/remove-lifecycle rotten标记此问题为新鲜
使用/close关闭此问题
提供帮助，使用Issue Triage

请将反馈发送至sig-contributor-experience@kubernetes/community。
/lifecycle rotten

赞(0）回复(0）举报 4个月前

我来回答

kubernetes 一个仍在运行的postStart生命周期钩子已被删除的pod,即使在终止后，也不会被终止,

8条答案

相关问题

热门标签

最新问答