英文:
CPU Load average rule for 5 minutes
问题
我们正在使用Prometheus-Grafana。现在我们想要设置一个关于5分钟CPU负载平均值的警报。
我们有60台服务器,这些服务器具有不同的CPU核心,例如一些机器有1个核心,2个核心,6个核心,8个核心等等。
下面的规则将给出5分钟负载的结果。但它不会区分机器是单核还是多核。
- name: alerting_rules
rules:
- alert: LoadAverage15m
expr: node_load5 >= 0.75
labels:
severity: major
annotations:
summary: "Instance {{ $labels.instance }} - high load average"
description: "{{ $labels.instance }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 5 minutes."
我已经尝试了下面的规则,但它也没有起作用:
- alert: LoadAverage5minutes
expr: node_load5/count(node_cpu{mode="idle"}) without (cpu,mode) >= 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Load average is high for 5 minutes (instance {{ $labels.instance }})"
description: "Load is high \n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
你能否帮助我了解我的规则需要哪些更改,以使其起作用。
谢谢。
英文:
We are using Prometheus-Grafana. Now we want to set alert for CPU load average of 5 minutes.
We have 60 servers which have different CPU core like few machine have 1 core, 2 core, 6 core, 8 core etc.
The below Rule will give the result for load 5 minutes. But it will not differentiate machine is single core or multicore.
- name: alerting_rules
rules:
- alert: LoadAverage15m
expr: node_load5 >= 0.75
labels:
severity: major
annotations:
summary: "Instance {{ $labels.instance }} - high load average"
description: "{{ $labels.instance }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 5 minutes."
I have tried below rule but it also not working:
- alert: LoadAverage5minutes
expr: node_load5/count(node_cpu{mode="idle"}) without (cpu,mode) >= 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Load average is high for 5 minutes (instance {{ $labels.instance }})"
description: "Load is high \n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Can you please help me what changes are required in my rule so it can work.
Thanks.
答案1
得分: 7
以下表达式应该可以工作:
expr: node_load5 / count by (instance, job) (node_cpu_seconds_total{mode="idle"}) >= 0.95
英文:
The following expression should work:
expr: node_load5 / count by (instance, job) (node_cpu_seconds_total{mode="idle"}) >= 0.95
答案2
得分: 1
以下查询在特定实例
上最近5分钟的平均CPU使用率超过95%时触发警报:
avg(
sum(
rate(node_cpu_seconds_total{mode!="idle"}[5m])
) without (mode)
) without (cpu) > 0.95
如果有一些应用程序无法扩展到多个CPU核心,那么如果实例
包含多个CPU核心,上述查询将不会注意到这些应用程序。例如,如果一个应用程序只能使用一个CPU核心,并且它在一个具有两个CPU核心的实例
上运行,那么上述查询将不会触发警报,因为平均CPU使用率不会超过50%。对于这种情况,建议使用以下警报查询:
max(
sum(
rate(node_cpu_seconds_total{mode!="idle"}[5m])
) without (mode)
) without (cpu) > 0.95
此查询在特定实例
上,在最近5分钟内至少有一个CPU核心的负载超过95%时触发警报。
英文:
The following query alerts when the average CPU usage for the last 5 minutes exceeds 95% on a particular instance
:
avg(
sum(
rate(node_cpu_seconds_total{mode!="idle"}[5m])
) without (mode)
) without (cpu) > 0.95
There may be applications, which cannot scale to multiple CPU cores. Such applications won't be noticed by the query above if instance
contains more than a single CPU core. For example, if an application can use only a single CPU core and it runs on an instance
with two CPU cores, then the query above won't trigger, since the average CPU usage doesn't exceed 50%. For such cases the following alerting query is recommended to use:
max(
sum(
rate(node_cpu_seconds_total{mode!="idle"}[5m])
) without (mode)
) without (cpu) > 0.95
This query alerts when at least a single CPU core is loaded for more than 95% during the last 5 minutes on a particular instance
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论