如何将Prometheus告警规则限制到Kubernetes服务?

huangapple go评论64阅读模式
英文:

How do you limit a Prometheus alerting rule to a Kubernetes Service?

问题

我目前正在使用这个Prometheus告警规则,它运行良好,但太一般化了:

sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100 > 50

我想以两种方式进行更改:

  1. 使 'container_cpu_usage_seconds_total{id="/"}[1m]))' 部分特定于运行执行计算的一种Kubernetes服务的Pod。

  2. 将第1点中的值除以计算Pod所需的CPU核心总和。目前这个值是500毫核心。

我应该如何做到这一点?

我找到了这个帖子,其中有人使用以下规则,但我不太确定如何修改它以满足我的要求:

sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name)
英文:

I am currently using this Prometheus alerting rule, which works fine, but is too general:

sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100 > 50

I would like to change it in two ways:

  1. Make the 'container_cpu_usage_seconds_total{id="/"}[1m]))' part specific for one Kubernetes Service that runs pods that execute a calculation

  2. Divide the value from point 1 by the sum of the cpu cores required by the calculation pods. Right now this is 500 millicores.

How do I do this?

I found this thread, in which someone uses the following rule, but I am not quite sure how to reform it to fit my criteria.

sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name)

答案1

得分: 0

这是我解决问题的方法:

((sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m]))) / ((avg(container_spec_cpu_quota{container="test-app"})/100000)*count(container_spec_cpu_quota{container="test-app"}))) * 100 > 50

第一部分是名称为 "test-app" 的容器使用的核心数。然后将其除以在创建时分配给它们的核心数。

除以 10000 是为了进行比较。如果最终值大于 50,即如果Pod使用超过其分配的CPU资源的 50%,则会触发警报。

对公式各部分的解释:

这个因子会获取 "test-app" 容器的总CPU使用情况。

sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m])))

这个因子代表分配给容器的CPU。

avg(container_spec_cpu_quota{container="test-app"})/100000

这个因子是 test-app 容器的数量。

count(container_spec_cpu_quota{container="test-app"})
英文:

This is how I solved my problem:

((sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m]))) / ((avg(container_spec_cpu_quota{container="test-app"})/100000)*count(container_spec_cpu_quota{container="test-app"}))) * 100 > 50

The first part is the amount of cores the containers with the name "test-app" are using. This is then divided by the amount of cores that were assinged to them on creation

The division by 10000 is necessary to compare the two. If the final value is bigger than 50, i.e. if the pods use more than 50% of their assigned CPU resource, an alert is registered.

Explanation of the different parts of the formula:

This factor scrapes the total cpu usage of the "test-app" container.

sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m])))

This factor represents the cpu assigned to the containers.

avg(container_spec_cpu_quota{container="test-app"})/100000

This factor is the amount of test-app containers present

count(container_spec_cpu_quota{container="test-app"})

huangapple
  • 本文由 发表于 2023年6月15日 19:41:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76482109.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定