kubernetes [bug] 存储错误:无效对象

sdnqo3pr  于 6个月前  发布在  Kubernetes
关注(0)|答案(5)|浏览(96)

kubernetes-sigs/controller-runtime#1881 复制而来,其中最初报告了这个问题,因为它不太可能与控制器运行时有关。不幸的是,prow传输插件不支持跨组织转移:(
/kind bug
/sig api-machinery
/cc @stijndehaes@pier-oliviert
大家好,
当我们将operator运行到生产环境时,我们看到了很多错误日志。我们无法弄清楚发生了什么。我们所做的事情的基本流程如下:

func (r *SparkReconciler) createSparkSubmitterPod(ctx context.Context, spark *runtimev1.Spark) error {
	var pod corev1.Pod
	err := r.Get(ctx, client.ObjectKey{
		Name:      submitterPodName(spark),
		Namespace: spark.Namespace,
	}, &pod)
	if apierrors.IsNotFound(err) {
		r.Log.V(1).Info("Creating the spark submitter pod")
		return r.doCreateSparkSubmitterPod(ctx, spark)
	}
	return errors.Wrap(err, "Failed getting the pod")
}

正如您在代码片段中看到的那样,我们只在pod不存在时创建一个pod。因为我们从缓存中读取,所以我们知道当pod存在时,get调用可能会返回not found。我们在doCreateSparkSubmitterPod中执行以下操作:

if err := r.Client.Create(ctx, pod); err != nil {
    if apierrors.IsAlreadyExists(err) {
	return nil
    }
    return errors.Wrap(err, "Failed creating the pod")
}

因此,我们捕获了一个已经存在的错误。然而,当我们运行到生产环境时,我们经常会看到很多这样的错误:

"error":"Operation cannot be fulfilled on pods \"255f70af-5699-46c6-8002-a4df45af5209\": StorageError: invalid object, Code: 4, Key: /registry/pods/addatatest2/255f70af-5699-46c6-8002-a4df45af5209, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: f8104e60-1a9d-40ca-847e-52bc0f556844, UID in object meta: "

我们对此一无所知。我认为这可能发生在pod已经存在的情况下。但不是完全确定。也许有人能解释一下?

i86rm4rw

i86rm4rw1#

嘿,我可以在我的测试环境中复现这个问题。
我们使用 controller-time/testenv 作为我们的测试设置。我们使用一个具有恒定延迟的自定义 rate limiter 。当延迟低于20ms时,我可以可靠地复现更新自定义资源时出现的问题。增加延迟时间解决了我的问题。
当这种情况发生时,尽管我没有明确删除它,但自定义资源似乎消失了。

xienkqul

xienkqul2#

有任何更新吗?不仅pods有问题,sts也有这个问题。

eqzww0vc

eqzww0vc3#

/help
/triage accepted

bbuxkriu

bbuxkriu4#

@fedebongio:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
/triage accepted
Instructions for interacting with me using PR comments are available here . If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vuv7lop3

vuv7lop35#

我们的组织在环境测试中也遇到了这个问题。我还没有深入研究,但我觉得UID前提条件部分很有趣:

Precondition failed: UID in precondition: f8104e60-1a9d-40ca-847e-52bc0f556844, UID in object meta: "

它看起来像是来自以下两个链接:https://github.com/kubernetes/kubernetes/blob/v1.30.0/staging/src/k8s.io/apiserver/pkg/storage/interfaces.go#L141-L144 和 https://github.com/kubernetes/apiserver/blob/v0.30.0/pkg/storage/interfaces.go#L141-L144

相关问题