AlertmanagerConfig未将警报发送到电子邮件接收器。

huangapple go评论94阅读模式
英文:

AlertmanagerConfig not sending alerts to email receiver

问题

我有一个部署的Pod,不断地进入CrashLoopBackoff状态。我已经设置了一个关于这个事件的警报,但是警报没有在配置的接收器上触发。警报只在每个AlertManager部署中配置的默认AlertManager接收器上触发。

AlertManager部署是bitnami/kube-prometheus堆栈部署的一部分。

我已经添加了应该发送警报的自定义接收器。这个接收器本质上是一个电子邮件接收者,它具有以下定义:

  1. apiVersion: monitoring.coreos.com/v1alpha1
  2. kind: AlertmanagerConfig
  3. metadata:
  4. name: pod-restarts-receiver
  5. namespace: monitoring
  6. labels:
  7. alertmanagerConfig: email
  8. release: prometheus
  9. spec:
  10. route:
  11. receiver: 'email-receiver'
  12. groupBy: ['alertname']
  13. groupWait: 30s
  14. groupInterval: 5m
  15. repeatInterval: 5m
  16. matchers:
  17. - name: job
  18. value: pod-restarts
  19. receivers:
  20. - name: 'email-receiver'
  21. emailConfigs:
  22. - to: 'etshuma@mycompany.com'
  23. sendResolved: true
  24. from: 'ops@mycompany.com'
  25. smarthost: 'mail2.mycompany.com:25'

这个警报由以下PrometheusRule触发:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: PrometheusRule
  3. metadata:
  4. name: pod-restarts-alert
  5. namespace: monitoring
  6. labels:
  7. app: kube-prometheus-stack
  8. release: prometheus
  9. spec:
  10. groups:
  11. - name: api
  12. rules:
  13. - alert: PodRestartsAlert
  14. expr: sum by (namespace, pod) (kube_pod_container_status_restarts_total{namespace="labs", pod="crash-loop-pod"}) > 5
  15. for: 1m
  16. labels:
  17. severity: critical
  18. job: pod-restarts
  19. annotations:
  20. summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has more than 5 restarts"
  21. description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has experienced more than 5 restarts."

我已经从AlertManager Pod中提取了默认接收器的定义,如下所示:

  1. kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 -- sh
  2. cd conf
  3. cat config.yaml

config.yaml具有以下定义:

  1. route:
  2. group_by: ['alertname']
  3. group_wait: 30s
  4. group_interval: 5m
  5. repeat_interval: 1h
  6. receiver: 'web.hook'
  7. receivers:
  8. - name: 'web.hook'
  9. webhook_configs:
  10. - url: 'http://127.0.0.1:5001/'
  11. inhibit_rules:
  12. - source_match:
  13. severity: 'critical'
  14. target_match:
  15. severity: 'warning'
  16. equal: ['alertname', 'dev', 'instance']

我还从AlertManager UI中提取了部署的全局配置。如预期的那样,它显示已添加新的警报接收器:

  1. global:
  2. resolve_timeout: 5m
  3. http_config:
  4. follow_redirects: true
  5. enable_http2: true
  6. smtp_hello: localhost
  7. smtp_require_tls: true
  8. pagerduty_url: https://events.pagerduty.com/v2/enqueue
  9. opsgenie_api_url: https://api.opsgenie.com/
  10. wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  11. victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  12. telegram_api_url: https://api.telegram.org
  13. webex_api_url: https://webexapis.com/v1/messages
  14. route:
  15. receiver: "null"
  16. group_by:
  17. - job
  18. continue: false
  19. routes:
  20. - receiver: monitoring/pod-restarts-receiver/email-receiver
  21. group_by:
  22. - alertname
  23. match:
  24. job: pod-restarts
  25. matchers:
  26. - namespace="monitoring"
  27. continue: true
  28. group_wait: 30s
  29. group_interval: 5m
  30. repeat_interval: 5m
  31. - receiver: "null"
  32. match:
  33. alertname: Watchdog
  34. continue: false
  35. group_wait: 30s
  36. group_interval: 5m
  37. repeat_interval: 12h
  38. receivers:
  39. - name: "null"
  40. - name: monitoring/pod-restarts-receiver/email-receiver
  41. email_configs:
  42. - send_resolved: true
  43. to: etshuma@mycompany.com
  44. from: ops@mycompany.com
  45. hello: localhost
  46. smarthost: mail2.mycompany.com:25
  47. headers:
  48. From: ops@mycompany.com
  49. Subject: '{{ template "email.default.subject" . }}'
  50. To: etshuma@mycompany.com
  51. html: '{{ template "email.default.html" . }}'
  52. require_tls: true
  53. templates: []

编辑

我从AlertManager的全局配置中有一些问题:

  1. 奇怪的是,在全局配置中我的接收器是"null"。(为什么?)
  2. 全局配置的顶部部分没有任何邮件设置(这可能是一个问题吗?)。
  3. 我不确定AlertManagerConfig级别定义的邮件设置是否有效,甚至如何更新全局配置文件(只能从Pod中访问它)。我查看了用于启动部署的values.yaml文件,它没有任何关于smarthost或任何邮件设置的选项。
  4. 全局配置文件中还有一个名为- namespace="monitoring"的额外匹配器。我是否需要在PrometheusRule中添加类似的命名空间标签?
  5. 这是否意味着AlertManagerConfig必须与PrometheusRule和目标Pod位于相同的命名空间中?

AlertConfigManager也未在https://prometheus.io/webtools/alerting/routing-tree-editor/中可视化任何内容。

我到底缺少什么?

英文:

I have a deployment of a pod that continuously goes into CrashLoopBackoff state. I have setup an alert forthis event but the alert is not firing on the configured receiver. The alert is only firing on the default AlertManager receiver that is configured with each AlertManager deployment.

The AlertManager deployment is part of a bitnami/kube-prometheus stack deployment.

I have added the custom receiver to which the alert should also be sent.This receiver is essentially an email recipient and it has the following definition:

  1. apiVersion: monitoring.coreos.com/v1alpha1
  2. kind: AlertmanagerConfig
  3. metadata:
  4. name: pod-restarts-receiver
  5. namespace: monitoring
  6. labels:
  7. alertmanagerConfig: email
  8. release: prometheus
  9. spec:
  10. route:
  11. receiver: 'email-receiver'
  12. groupBy: ['alertname']
  13. groupWait: 30s
  14. groupInterval: 5m
  15. repeatInterval: 5m
  16. matchers:
  17. - name: job
  18. value: pod-restarts
  19. receivers:
  20. - name: 'email-receiver'
  21. emailConfigs:
  22. - to: 'etshuma@mycompany.com'
  23. sendResolved: true
  24. from: 'ops@mycompany.com'
  25. smarthost: 'mail2.mycompany.com:25'

This alert is triggered by the PrometheusRule below :

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: PrometheusRule
  3. metadata:
  4. name: pod-restarts-alert
  5. namespace: monitoring
  6. labels:
  7. app: kube-prometheus-stack
  8. release: prometheus
  9. spec:
  10. groups:
  11. - name: api
  12. rules:
  13. - alert: PodRestartsAlert
  14. expr: sum by (namespace, pod) (kube_pod_container_status_restarts_total{namespace="labs", pod="crash-loop-pod"}) > 5
  15. for: 1m
  16. labels:
  17. severity: critical
  18. job: pod-restarts
  19. annotations:
  20. summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has more than 5 restarts"
  21. description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has experienced more than 5 restarts."

I have extracted the definition of the default receiver in AlertManager pod as follows:

  1. kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 --
  2. sh
  3. cd conf
  4. cat config.yaml

And config.yaml has the following definition:

  1. route:
  2. group_by: ['alertname']
  3. group_wait: 30s
  4. group_interval: 5m
  5. repeat_interval: 1h
  6. receiver: 'web.hook'
  7. receivers:
  8. - name: 'web.hook'
  9. webhook_configs:
  10. - url: 'http://127.0.0.1:5001/'
  11. inhibit_rules:
  12. - source_match:
  13. severity: 'critical'
  14. target_match:
  15. severity: 'warning'
  16. equal: ['alertname', 'dev', 'instance']

And I have also extracted from the AlertManager UI , the global configuration of the deployment. As expected it shows that the new alert receiver has been added :

  1. global:
  2. resolve_timeout: 5m
  3. http_config:
  4. follow_redirects: true
  5. enable_http2: true
  6. smtp_hello: localhost
  7. smtp_require_tls: true
  8. pagerduty_url: https://events.pagerduty.com/v2/enqueue
  9. opsgenie_api_url: https://api.opsgenie.com/
  10. wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  11. victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  12. telegram_api_url: https://api.telegram.org
  13. webex_api_url: https://webexapis.com/v1/messages
  14. route:
  15. receiver: "null"
  16. group_by:
  17. - job
  18. continue: false
  19. routes:
  20. - receiver: monitoring/pod-restarts-receiver/email-receiver
  21. group_by:
  22. - alertname
  23. match:
  24. job: pod-restarts
  25. matchers:
  26. - namespace="monitoring"
  27. continue: true
  28. group_wait: 30s
  29. group_interval: 5m
  30. repeat_interval: 5m
  31. - receiver: "null"
  32. match:
  33. alertname: Watchdog
  34. continue: false
  35. group_wait: 30s
  36. group_interval: 5m
  37. repeat_interval: 12h
  38. receivers:
  39. - name: "null"
  40. - name: monitoring/pod-restarts-receiver/email-receiver
  41. email_configs:
  42. - send_resolved: true
  43. to: etshuma@mycompany.com
  44. from: ops@mycompany.com
  45. hello: localhost
  46. smarthost: mail2.mycompany.com:25
  47. headers:
  48. From: ops@mycompany.com
  49. Subject: '{{ template "email.default.subject" . }}'
  50. To: etshuma@mycompany.com
  51. html: '{{ template "email.default.html" . }}'
  52. require_tls: true
  53. templates: []

EDIT

I have a number of questions from the global config for AlertManager :

  1. Strangely enough in the global configuration my receivers are "null". (Why?)
  2. The topmost section of the global config doesn't have any mail settings (Could this be an issue?).
  3. I'm not sure if the mail settings defined at AlertManagerConfig level work or even how to update the global config file (its accessible from the the pod only). I have looked at the values.yaml file used to spin up the deployment and it doesn't have any options for smarthost or any mail settings
  4. There is an additional matcher in the global config file named - namespace="monitoring" . Do I need to add a similar namespace label in the PrometheusRule ? .
  5. Does it mean AlertManagerConfig has to be in the same namespace as the PrometheusRule and the target pod

The AlertConfigManager is also not visualizing anything at https://prometheus.io/webtools/alerting/routing-tree-editor/

What exactly am I missing ?

答案1

得分: 0

问题是由TLS验证失败引起的。在检查日志后,我找到了以下信息:

  1. kubectl -n monitoring logs alertmanager-prometheus-kube-prometheus-alertmanager-0 --since=10m
  2. STARTTLS command: x509: certificate signed by unknown authority"
  3. ts=2023-07-23T11:18:40.660Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="monitoring/pod-restarts-receiver/email/email[0]: notify retry canceled after 13 attempts: send STARTTLS command: x509: certificate signed by unknown authority"
  4. ts=2023-07-23T11:18:40.707Z caller=notify.go:732 level=warn component=dispatcher receiver=monitoring/pod-restarts-receiver/email integration=email[0] msg="Notify attempt failed, will retry later" attempts=1 err="send STARTTLS command: x509: certificate signed by unknown authority"
  5. ts=2023-07-23T11:18:41.380Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
  6. ts=2023-07-23T11:18:41.390Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml

AlertManagerConfig 需要使用 requireTLS 标志设置为 false 进行更新:

  1. apiVersion: monitoring.coreos.com/v1alpha1
  2. kind: AlertmanagerConfig
  3. metadata:
  4. name: pod-restarts-receiver
  5. namespace: monitoring
  6. labels:
  7. release: prometheus
  8. spec:
  9. route:
  10. groupBy: ['alertname']
  11. groupWait: 30s
  12. groupInterval: 2m
  13. repeatInterval: 2m
  14. receiver: email
  15. routes:
  16. - matchers:
  17. - name: job
  18. value: pod-restarts
  19. receiver: email
  20. receivers:
  21. - name: email
  22. emailConfigs:
  23. - to: 'etshuma@mycompany.com'
  24. from: 'ops@mycompany.com'
  25. smarthost: 'mail2.mycompany.com:25'
  26. requireTLS: false
英文:

The issue was being caused by TLS verification failure. After checking the logs this is what I found out :

  1. kubectl -n monitoring logs alertmanager-prometheus-kube-prometheus-
  2. alertmanager-0 --since=10m
  3. STARTTLS command: x509: certificate signed by unknown authority"
  4. ts=2023-07-23T11:18:40.660Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="monitoring/pod-restarts-receiver/email/email[0]: notify retry canceled after 13 attempts: send STARTTLS command: x509: certificate signed by unknown authority"
  5. ts=2023-07-23T11:18:40.707Z caller=notify.go:732 level=warn component=dispatcher receiver=monitoring/pod-restarts-receiver/email integration=email[0] msg="Notify attempt failed, will retry later" attempts=1 err="send STARTTLS command: x509: certificate signed by unknown authority"
  6. ts=2023-07-23T11:18:41.380Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
  7. ts=2023-07-23T11:18:41.390Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml

Th AlertManagerConfig needs to be updated with the requireTLS flag to false :

  1. apiVersion: monitoring.coreos.com/v1alpha1
  2. kind: AlertmanagerConfig
  3. metadata:
  4. name: pod-restarts-receiver
  5. namespace: monitoring
  6. labels:
  7. release: prometheus
  8. spec:
  9. route:
  10. groupBy: ['alertname']
  11. groupWait: 30s
  12. groupInterval: 2m
  13. repeatInterval: 2m
  14. receiver: email
  15. routes:
  16. - matchers:
  17. - name: job
  18. value: pod-restarts
  19. receiver: email
  20. receivers:
  21. - name: email
  22. emailConfigs:
  23. - to: 'etshuma@mycompany.com'
  24. from: 'ops@mycompany.com'
  25. smarthost: 'mail2.mycompany.com:25'
  26. requireTLS: false

huangapple
  • 本文由 发表于 2023年7月13日 16:53:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677571.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定