2020年1月6日 20:23:05go评论136阅读模式

英文:

CPU Load average rule for 5 minutes

问题

我们正在使用Prometheus-Grafana。现在我们想要设置一个关于5分钟CPU负载平均值的警报。

我们有60台服务器，这些服务器具有不同的CPU核心，例如一些机器有1个核心，2个核心，6个核心，8个核心等等。

下面的规则将给出5分钟负载的结果。但它不会区分机器是单核还是多核。

- name: alerting_rules
    rules:
      - alert: LoadAverage15m
        expr: node_load5 &gt;= 0.75
        labels:
          severity: major
        annotations:
          summary: &quot;Instance {{ $labels.instance }} - high load average&quot;
          description: &quot;{{ $labels.instance  }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 5 minutes.&quot;

我已经尝试了下面的规则，但它也没有起作用：

- alert: LoadAverage5minutes
    expr: node_load5/count(node_cpu{mode=&quot;idle&quot;}) without (cpu,mode) &gt;= 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: &quot;Load average is high for 5 minutes (instance {{ $labels.instance }})&quot;
      description: &quot;Load is high \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}&quot;

你能否帮助我了解我的规则需要哪些更改，以使其起作用。

谢谢。

英文:

We are using Prometheus-Grafana. Now we want to set alert for CPU load average of 5 minutes.

We have 60 servers which have different CPU core like few machine have 1 core, 2 core, 6 core, 8 core etc.

The below Rule will give the result for load 5 minutes. But it will not differentiate machine is single core or multicore.

- name: alerting_rules
    rules:
      - alert: LoadAverage15m
        expr: node_load5 &gt;= 0.75
        labels:
          severity: major
        annotations:
          summary: &quot;Instance {{ $labels.instance }} - high load average&quot;
          description: &quot;{{ $labels.instance  }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 5 minutes.&quot;

I have tried below rule but it also not working:

- alert: LoadAverage5minutes
    expr: node_load5/count(node_cpu{mode=&quot;idle&quot;}) without (cpu,mode) &gt;= 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: &quot;Load average is high for 5 minutes (instance {{ $labels.instance }})&quot;
      description: &quot;Load is high \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}&quot;

Can you please help me what changes are required in my rule so it can work.

Thanks.

答案1

得分: 7

以下表达式应该可以工作：

expr: node_load5 / count by (instance, job) (node_cpu_seconds_total{mode="idle"}) >= 0.95

英文:

The following expression should work:

expr: node_load5 / count by (instance, job) (node_cpu_seconds_total{mode=&quot;idle&quot;}) &gt;= 0.95

答案2

得分: 1

以下查询在特定实例上最近5分钟的平均CPU使用率超过95%时触发警报：

avg(
  sum(
    rate(node_cpu_seconds_total{mode!="idle"}[5m])
  ) without (mode)
) without (cpu) > 0.95

如果有一些应用程序无法扩展到多个CPU核心，那么如果实例包含多个CPU核心，上述查询将不会注意到这些应用程序。例如，如果一个应用程序只能使用一个CPU核心，并且它在一个具有两个CPU核心的实例上运行，那么上述查询将不会触发警报，因为平均CPU使用率不会超过50%。对于这种情况，建议使用以下警报查询：

max(
  sum(
    rate(node_cpu_seconds_total{mode!="idle"}[5m])
  ) without (mode)
) without (cpu) > 0.95

此查询在特定实例上，在最近5分钟内至少有一个CPU核心的负载超过95%时触发警报。

英文:

The following query alerts when the average CPU usage for the last 5 minutes exceeds 95% on a particular instance:

avg(
  sum(
    rate(node_cpu_seconds_total{mode!=&quot;idle&quot;}[5m])
  ) without (mode)
) without (cpu) &gt; 0.95

There may be applications, which cannot scale to multiple CPU cores. Such applications won't be noticed by the query above if instance contains more than a single CPU core. For example, if an application can use only a single CPU core and it runs on an instance with two CPU cores, then the query above won't trigger, since the average CPU usage doesn't exceed 50%. For such cases the following alerting query is recommended to use:

max(
  sum(
    rate(node_cpu_seconds_total{mode!=&quot;idle&quot;}[5m])
  ) without (mode)
) without (cpu) &gt; 0.95

This query alerts when at least a single CPU core is loaded for more than 95% during the last 5 minutes on a particular instance.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

CPU负载平均规则，针对5分钟。

问题

答案1

答案2

连接celery-flower和prometheus在单独的docker-compose文件中。

Prometheus – 如何在指标中包含注释

需要关于Prometheus内存利用率查询的指导。

将Prometheus指标转换为Json使用Golang

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论