kubernetes 失败的GKE CronJob的GCP警报策略

x6yk4ghg 于 2023-04-29 发布在 Kubernetes

关注(0)|答案(2)|浏览(204)

为Kubernetes CronJob失败设置GCP monitoring alert policy的最佳方法是什么？我还没有找到任何好的例子。
现在，我有一个基于ERROR严重性的Pod中监视日志的OK解决方案。我发现这是相当片状，然而。有时候，一项工作会因为一些我无法控制的短暂原因而失败。例如，外部服务器返回临时500），并且在下一次重试时，作业成功运行。
我真正需要的是一个仅在CronJob处于持久失败状态时触发的警报。也就是说，Kubernetes已经多次尝试重新运行整个过程，但仍然失败。理想情况下，它还可以处理Pod也无法出现的情况（例如：例如，下载图像失败）。
有什么想法吗？
谢谢。

kubernetes

来源：https://stackoverflow.com/questions/71485483/gcp-alerting-policy-for-failed-gke-cronjob

2条答案

按热度按时间

9o685dep1#

首先确认您运行的GKE版本。为此，以下命令将帮助您识别GKE的默认版本和可用版本：

默认版本。

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.defaultVersion)"

可用版本。

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.validVersions)"

现在您已经知道了您的GKE版本，并且根据您想要的是仅当CronJob处于持久失败状态时才会触发的警报，GKE Workload Metrics是GCP的解决方案，用于提供完全托管且高度可配置的解决方案，用于将GKE工作负载发出的所有Prometheus兼容指标发送到Cloud Monitoring（例如CronJob或应用程序的部署）。但是，因为它现在在G K E 1中被弃用。24并被Google Cloud Managed Service for Prometheus取代，那么最后一个是您在GCP中获得的最佳选择，因为它允许您使用Prometheus监控和提醒您的工作负载，而无需手动管理和大规模操作Prometheus。
此外，您在GCP之外还有两个选择：Prometheus以及Ranch的Prometheus Push Gateway。
最后，仅供参考，它可以手动完成，通过查询作业，然后检查它的开始时间，并将其与当前时间进行比较，这样，使用bash：

START_TIME=$(kubectl -n=your-namespace get job your-job-name -o json | jq '.status.startTime')
echo $START_TIME

或者，您可以将作业的当前状态作为JSON blob，如下所示：

kubectl -n=your-namespace get job your-job-name -o json | jq '.status'

你也可以看到下面的thread以获得更多参考。
将**“Failed”状态作为需求的核心点，使用kubectl设置一个bash脚本，以便在看到处于“Failed”**状态的作业时发送电子邮件。在这里我将与您分享一些例子：

while true; do if `kubectl get jobs myjob -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' | grep True`; then mail email@address -s jobfailed; else sleep 1 ; fi; done

对于较新的K8：

while true; do kubectl wait --for=condition=failed job/myjob; mail@address -s jobfailed; done

赞(0）回复(0）举报 2023-04-29

8xiog9wr2#

我正在寻找相同的解决方案来监控GKE Cronjobs，并发现了这个方法：
通过利用GCP的日志警报功能，我们能够使用以下日志查询在Cronjob的作业被认为失败时得到通知Log Based Alert Doc

resource.labels.cluster_name="CLUSTER_NAME"
resource.type="k8s_cluster"
jsonPayload.source.component="cronjob-controller"
jsonPayload.reason="SawCompletedJob"
"status: Failed"

样本日志是这样的

{
  "insertId": "sb6oijf4yi39m",
  "jsonPayload": {
    "type": "Normal",
    "reportingComponent": "",
    "source": {
      "component": "cronjob-controller"
    },
    "lastTimestamp": "2023-04-19T12:05:53Z",
    "metadata": {
      "uid": "0efad02d-c441-4964-b048-496552ecc572",
      "namespace": "default",
      "managedFields": [
        {
          "apiVersion": "v1",
          "time": "2023-04-19T12:05:53Z",
          "manager": "kube-controller-manager",
          "fieldsType": "FieldsV1",
          "operation": "Update",
          "fieldsV1": {
            "f:count": {},
            "f:reason": {},
            "f:firstTimestamp": {},
            "f:type": {},
            "f:source": {
              "f:component": {}
            },
            "f:involvedObject": {},
            "f:lastTimestamp": {},
            "f:message": {}
          }
        }
      ],
      "resourceVersion": "47727088",
      "creationTimestamp": "2023-04-19T12:05:53Z",
      "name": "CRONJOB_NAME.1757548d9eb51a26"
    },
    "message": "Saw completed job: CRONJOB_NAME-28031760, status: Failed",
    "kind": "Event",
    "eventTime": null,
    "involvedObject": {
      "apiVersion": "batch/v1",
      "namespace": "default",
      "kind": "CronJob",
      "name": "CRONJOB_NAME",
      "uid": "6c43108b-14d6-11ea-ac1e-42010af00026",
      "resourceVersion": "1286547540"
    },
    "reportingInstance": "",
    "apiVersion": "v1",
    "reason": "SawCompletedJob"
  },
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "cluster_name": "REDACTED",
      "project_id": "REDACTED",
      "location": "asia-east1-a"
    }
  },
  "timestamp": "2023-04-19T12:05:53Z",
  "severity": "INFO",
  "logName": "projects/REDACTED/logs/events",
  "receiveTimestamp": "2023-04-19T12:05:58.075764494Z"
}

赞(0）回复(0）举报 2023-04-29

我来回答

kubernetes 失败的GKE CronJob的GCP警报策略

2条答案

相关问题

热门标签

最新问答