kubernetes AlertmanagerConfig未向电子邮件接收者发送警报

tzxcd3kk  于 2023-08-03  发布在  Kubernetes
关注(0)|答案(1)|浏览(170)

我有一个Pod的部署,它不断进入CrashLoopBackoff状态。我已为此事件设置了警报,但警报未在配置的接收器上触发。警报仅在每个AlertManager部署 * 配置的 * 默认AlertManager接收器上触发。
AlertManager部署是bitnami/kube-prometheus堆栈部署的一部分。
我已经添加了自定义接收器,警报也应该发送到该接收器。此接收器本质上是电子邮件收件人,它具有以下定义:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: pod-restarts-receiver
  namespace: monitoring
  labels:
    alertmanagerConfig: email
    release: prometheus
spec:
  route:
    receiver: 'email-receiver'
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 5m
    matchers:
      - name: job
        value: pod-restarts
  receivers:
  - name: 'email-receiver'
    emailConfigs:
      - to: 'etshuma@mycompany.com'
        sendResolved: true
        from: 'ops@mycompany.com'
        smarthost: 'mail2.mycompany.com:25'

字符串
此警报由以下PrometheusRule触发:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-restarts-alert
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: prometheus
spec:
  groups:
    - name: api
      rules:
        - alert: PodRestartsAlert 
          expr: sum by (namespace, pod) (kube_pod_container_status_restarts_total{namespace="labs", pod="crash-loop-pod"}) > 5
          for: 1m
          labels:
            severity: critical
            job: pod-restarts
          annotations:
            summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has more than 5 restarts"
            description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has experienced more than 5 restarts."


我在AlertManager pod中提取了 default receiver 的定义,如下所示:

kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 -- 
sh
cd conf
cat config.yaml


并且config.yaml具有以下定义:

route:
group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']


我还从AlertManager UI中提取了部署的全局配置。正如预期的那样,它显示已添加新的警报接收器:

global:

resolve_timeout: 5m
  http_config:
    follow_redirects: true
    enable_http2: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
  webex_api_url: https://webexapis.com/v1/messages
route:
  receiver: "null"
  group_by:
  - job
  continue: false
  routes:
  - receiver: monitoring/pod-restarts-receiver/email-receiver
    group_by:
    - alertname
    match:
      job: pod-restarts
    matchers:
    - namespace="monitoring"
    continue: true
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 5m
  - receiver: "null"
    match:
      alertname: Watchdog
    continue: false
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
receivers:
- name: "null"
- name: monitoring/pod-restarts-receiver/email-receiver
  email_configs:
  - send_resolved: true
    to: etshuma@mycompany.com
    from: ops@mycompany.com
    hello: localhost
    smarthost: mail2.mycompany.com:25
    headers:
      From: ops@mycompany.com
      Subject: '{{ template "email.default.subject" . }}'
      To: etshuma@mycompany.com
    html: '{{ template "email.default.html" . }}'
    require_tls: true
templates: []

编辑

我对AlertManager的全局配置有一些问题:
1.奇怪的是,在全局配置中,我的接收器是“null”。(为什么?)
1.全局配置的最顶端部分没有任何邮件设置(这可能是一个问题吗?).
1.我不确定在AlertManagerConfig级别定义的邮件设置是否有效,甚至不确定如何更新全局配置文件(只能从Pod访问)。我查看了用于启动部署的values.yaml文件,它没有任何关于 smarthost 或任何 mail 设置的选项
1.全局配置文件中还有一个名为- namespace="monitoring"的匹配器。我是否需要在PrometheusRule中添加类似的命名空间标签?.
1.是否意味着AlertManagerConfig必须与PrometheusRule和目标pod位于同一命名空间 * 中
AlertConfigManager也无法在https://prometheus.io/webtools/alerting/routing-tree-editor/上显示任何内容
我到底错过了什么

oprakyz7

oprakyz71#

问题是TLS验证失败引起的。在查看日志后,这是我发现的:

kubectl -n monitoring logs alertmanager-prometheus-kube-prometheus-
alertmanager-0 --since=10m
    STARTTLS command: x509: certificate signed by unknown authority"
    ts=2023-07-23T11:18:40.660Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="monitoring/pod-restarts-receiver/email/email[0]: notify retry canceled after 13 attempts: send STARTTLS command: x509: certificate signed by unknown authority"
    ts=2023-07-23T11:18:40.707Z caller=notify.go:732 level=warn component=dispatcher receiver=monitoring/pod-restarts-receiver/email integration=email[0] msg="Notify attempt failed, will retry later" attempts=1 err="send STARTTLS command: x509: certificate signed by unknown authority"
    ts=2023-07-23T11:18:41.380Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
    ts=2023-07-23T11:18:41.390Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml

字符串
AlertManagerConfig需要更新,requireTLS标志为false:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: pod-restarts-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: email
    routes:
      - matchers:
        - name: job
          value: pod-restarts
        receiver: email
  receivers:
    - name: email
      emailConfigs:
        - to: 'etshuma@mycompany.com'
          from: 'ops@mycompany.com'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false

相关问题