2023年6月15日 19:41:22go评论100阅读模式

英文:

How do you limit a Prometheus alerting rule to a Kubernetes Service?

问题

我目前正在使用这个Prometheus告警规则，它运行良好，但太一般化了：

sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100 > 50

我想以两种方式进行更改：

使 'container_cpu_usage_seconds_total{id="/"}[1m]))' 部分特定于运行执行计算的一种Kubernetes服务的Pod。
将第1点中的值除以计算Pod所需的CPU核心总和。目前这个值是500毫核心。

我应该如何做到这一点？

我找到了这个帖子，其中有人使用以下规则，但我不太确定如何修改它以满足我的要求：

sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name)

英文:

I am currently using this Prometheus alerting rule, which works fine, but is too general:

sum (rate (container_cpu_usage_seconds_total{id=&quot;/&quot;}[1m])) / sum (machine_cpu_cores) * 100 &gt; 50

I would like to change it in two ways:

Make the 'container_cpu_usage_seconds_total{id="/"}[1m]))' part specific for one Kubernetes Service that runs pods that execute a calculation
Divide the value from point 1 by the sum of the cpu cores required by the calculation pods. Right now this is 500 millicores.

How do I do this?

I found this thread, in which someone uses the following rule, but I am not quite sure how to reform it to fit my criteria.

sum (rate (container_cpu_usage_seconds_total{image!=&quot;&quot;}[1m])) by (pod_name)

答案1

得分: 0

这是我解决问题的方法：

((sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m]))) / ((avg(container_spec_cpu_quota{container="test-app"})/100000)*count(container_spec_cpu_quota{container="test-app"}))) * 100 > 50

第一部分是名称为 "test-app" 的容器使用的核心数。然后将其除以在创建时分配给它们的核心数。

除以 10000 是为了进行比较。如果最终值大于 50，即如果Pod使用超过其分配的CPU资源的 50％，则会触发警报。

对公式各部分的解释：

这个因子会获取 "test-app" 容器的总CPU使用情况。

sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m])))

这个因子代表分配给容器的CPU。

avg(container_spec_cpu_quota{container="test-app"})/100000

这个因子是 test-app 容器的数量。

count(container_spec_cpu_quota{container="test-app"})

英文:

This is how I solved my problem:

((sum (rate (container_cpu_usage_seconds_total{container=&quot;test-app&quot;}[1m]))) / ((avg(container_spec_cpu_quota{container=&quot;test-app&quot;})/100000)*count(container_spec_cpu_quota{container=&quot;test-app&quot;}))) * 100 &gt; 50

The first part is the amount of cores the containers with the name "test-app" are using. This is then divided by the amount of cores that were assinged to them on creation

The division by 10000 is necessary to compare the two. If the final value is bigger than 50, i.e. if the pods use more than 50% of their assigned CPU resource, an alert is registered.

Explanation of the different parts of the formula:

This factor scrapes the total cpu usage of the "test-app" container.

sum (rate (container_cpu_usage_seconds_total{container=&quot;test-app&quot;}[1m])))

This factor represents the cpu assigned to the containers.

avg(container_spec_cpu_quota{container=&quot;test-app&quot;})/100000

This factor is the amount of test-app containers present

count(container_spec_cpu_quota{container=&quot;test-app&quot;})

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将Prometheus告警规则限制到Kubernetes服务？

问题

答案1

Kubernetes基于时间计划的资源请求/限制

无法从JupyterHub运行Spark作业。

Kubernetes控制器日志记录来自一个上下文

Kubernetes client-go使用Informers来监视部署。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。