kubernetes Leaderections失败,租约无法自动续订

yfjy0ee7  于 2023-01-20  发布在  Kubernetes
关注(0)|答案(1)|浏览(361)

我有一个生产群集当前正在K8s版本1.19.9上运行,其中的kube-scheduler和kube-controller-manager无法进行领导者选举。领导者可以获取第一个租约,但随后无法续订/重新获取租约。这导致其他豆荚不断地在选举领导人的循环中,因为他们中没有人能停留足够长的时间来处理任何事情。停留足够长的时间来做任何有意义的事情,他们超时,另一个吊舱将采取新的租约;这种情况在节点之间发生。2下面是日志:

E1201 22:15:54.818902       1 request.go:1001] Unexpected error when reading response body: context deadline exceeded
E1201 22:15:54.819079       1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I1201 22:15:54.819137       1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1201 22:15:54.819176       1 controllermanager.go:293] leaderelection lost

详细的Docker日志:

Flag --port has been deprecated, see --secure-port instead.
I1201 22:14:10.374271       1 serving.go:331] Generated self-signed cert in-memory
I1201 22:14:10.735495       1 controllermanager.go:175] Version: v1.19.9+vmware.1
I1201 22:14:10.736289       1 dynamic_cafile_content.go:167] Starting request-header::/etc/kubernetes/pki/front-proxy-ca.crt
I1201 22:14:10.736302       1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
I1201 22:14:10.736684       1 secure_serving.go:197] Serving securely on 0.0.0.0:10257
I1201 22:14:10.736747       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/kube-controller-manager...
I1201 22:14:10.736868       1 tlsconfig.go:240] Starting DynamicServingCertificateController
E1201 22:14:20.737137       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:14:32.803658       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:14:44.842075       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:15:13.386932       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: context deadline exceeded
I1201 22:15:44.818571       1 leaderelection.go:253] successfully acquired lease kube-system/kube-controller-manager
I1201 22:15:44.818755       1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="Endpoints" apiVersion="v1" type="Normal" reason="LeaderElection" message="master001_1d360610-1111-xxxx-aaaa-9999 became leader"
I1201 22:15:44.818790       1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="master001_1d360610-1111-xxxx-aaaa-9999 became leader"
E1201 22:15:54.818902       1 request.go:1001] Unexpected error when reading response body: context deadline exceeded
E1201 22:15:54.819079       1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I1201 22:15:54.819137       1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1201 22:15:54.819176       1 controllermanager.go:293] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc000fb20d0, 0x4c, 0xc6)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x6a57fa0, 0xc000000003, 0x0, 0x0, 0xc000472070, 0x68d5705, 0x14, 0x125, 0x0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:945 +0x191

我的胶带恢复方法是关闭其他候选项并禁用领导者选举--leader-elect=false。我们手动设置一个领导者,让它停留一段时间,然后重新激活领导者选举后。这似乎是工作的预期再次,租约是正常更新后。
有没有可能是api服务器可能太过不堪重负,无法花费任何资源(?),因为选举因超时而失败?我想知道是否有人遇到过这样的问题。

5n0oy7gb

5n0oy7gb1#

@janeosaka,您是对的当您有1)resource crunch2)network issue时会出现此问题。
由于Kube API服务器资源紧张,似乎领导者选举API调用超时,这增加了API调用的延迟。

1)资源不足:增加节点的CPU和内存

这似乎是预期的行为。当领导者选举失败时,控制器无法续订租约,并且根据设计,控制器将重新启动,以确保一次只有一个控制器处于活动状态。
LeaseDuration和RenewDeadline(RenewDeadline是代理主机将重试的持续时间)在控制器运行时中是可配置的。
您可以考虑的另一种方法是利用API Priority & Fairness来增加控制器调用API的成功机会(如果它不是API重载的来源)。

**2)网络问题:**如果是网络问题:(领导选举失败是主机网络出现问题的症状,而不是原因)。

Check the issue may resolve after restarting the SDN pod

"sdn-controller""sdn"是非常不同的东西。如果重新启动 sdn pod可以修复问题,那么您注意到的 sdn-controller 错误并不是实际问题。

相关问题