我有一个Pod的部署,它不断进入CrashLoopBackoff状态。我已为此事件设置了警报,但警报未在配置的接收器上触发。警报仅在每个AlertManager部署 * 配置的 * 默认AlertManager接收器上触发。
AlertManager部署是bitnami/kube-prometheus堆栈部署的一部分。
我已经添加了自定义接收器,警报也应该发送到该接收器。此接收器本质上是电子邮件收件人,它具有以下定义:
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: pod-restarts-receiver
namespace: monitoring
labels:
alertmanagerConfig: email
release: prometheus
spec:
route:
receiver: 'email-receiver'
groupBy: ['alertname']
groupWait: 30s
groupInterval: 5m
repeatInterval: 5m
matchers:
- name: job
value: pod-restarts
receivers:
- name: 'email-receiver'
emailConfigs:
- to: 'etshuma@mycompany.com'
sendResolved: true
from: 'ops@mycompany.com'
smarthost: 'mail2.mycompany.com:25'
字符串
此警报由以下PrometheusRule触发:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-restarts-alert
namespace: monitoring
labels:
app: kube-prometheus-stack
release: prometheus
spec:
groups:
- name: api
rules:
- alert: PodRestartsAlert
expr: sum by (namespace, pod) (kube_pod_container_status_restarts_total{namespace="labs", pod="crash-loop-pod"}) > 5
for: 1m
labels:
severity: critical
job: pod-restarts
annotations:
summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has more than 5 restarts"
description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has experienced more than 5 restarts."
型
我在AlertManager pod中提取了 default receiver 的定义,如下所示:
kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 --
sh
cd conf
cat config.yaml
型
并且config.yaml具有以下定义:
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
型
我还从AlertManager UI中提取了部署的全局配置。正如预期的那样,它显示已添加新的警报接收器:
global:
resolve_timeout: 5m
http_config:
follow_redirects: true
enable_http2: true
smtp_hello: localhost
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
telegram_api_url: https://api.telegram.org
webex_api_url: https://webexapis.com/v1/messages
route:
receiver: "null"
group_by:
- job
continue: false
routes:
- receiver: monitoring/pod-restarts-receiver/email-receiver
group_by:
- alertname
match:
job: pod-restarts
matchers:
- namespace="monitoring"
continue: true
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
- receiver: "null"
match:
alertname: Watchdog
continue: false
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: "null"
- name: monitoring/pod-restarts-receiver/email-receiver
email_configs:
- send_resolved: true
to: etshuma@mycompany.com
from: ops@mycompany.com
hello: localhost
smarthost: mail2.mycompany.com:25
headers:
From: ops@mycompany.com
Subject: '{{ template "email.default.subject" . }}'
To: etshuma@mycompany.com
html: '{{ template "email.default.html" . }}'
require_tls: true
templates: []
型
编辑
我对AlertManager的全局配置有一些问题:
1.奇怪的是,在全局配置中,我的接收器是“null”。(为什么?)
1.全局配置的最顶端部分没有任何邮件设置(这可能是一个问题吗?).
1.我不确定在AlertManagerConfig级别定义的邮件设置是否有效,甚至不确定如何更新全局配置文件(只能从Pod访问)。我查看了用于启动部署的values.yaml文件,它没有任何关于 smarthost 或任何 mail 设置的选项
1.全局配置文件中还有一个名为- namespace="monitoring"
的匹配器。我是否需要在PrometheusRule中添加类似的命名空间标签?.
1.是否意味着AlertManagerConfig必须与PrometheusRule和目标pod位于同一命名空间 * 中
AlertConfigManager也无法在https://prometheus.io/webtools/alerting/routing-tree-editor/上显示任何内容
我到底错过了什么
1条答案
按热度按时间oprakyz71#
问题是TLS验证失败引起的。在查看日志后,这是我发现的:
字符串
AlertManagerConfig需要更新,requireTLS标志为false:
型