英文:
Prometheus - Percentage of gauge values below a certain threshold
问题
我正在使用黑匣子导出器从不同的端点收集指标,并希望设置一个 SLI 来确定每个服务中慢于 300ms
和 1s
的 GET 请求数量。导出器提供了一个名为 probe_duration_seconds
的 gauge
指标。我正在尝试运行一个 PromQL 查询,以计算在过去的 5 小时内低于 300ms 的 probe_duration_seconds
占比。
我的当前查询 probe_duration_seconds{}[5h] < 0.3
返回错误:
执行查询时出错:无效的参数 "query":1:1:解析错误:二进制表达式必须只包含标量和即时向量类型。
我还尝试过:
100 - sum(rate(probe_success{}[5h]) * 100) by (instance)
这会给我总体的成功/失败率,但我也想根据响应时间来量化它。
英文:
I'm using the blackbox exporter to gather metrics from various endpoints, and I want to set a SLI to determine the number of GET requests that are slower than 300ms
and 1s
per service.
The exporter provides a gauge
metric called probe_duration_seconds
.
I'm trying to run a PromQL query to calculate the percentage of probe_duration_seconds that are below 300ms in the last 5 hours.
My current query probe_duration_seconds{}[5h] < 0.3
returns an error:
> Error executing query: invalid parameter "query": 1:1: parse error:
> binary expression must contain only scalar and instant vector types.
I have also tried:
100 - sum(rate(probe_success{}[5h]) * 100) by (instance)
which gives me the overall success/failure rate, but I want to quantify it based on response time as well.
答案1
得分: 1
Prometheus没有提供一个函数,可以返回给定回溯窗口上小于给定阈值的原始样本的百分比。这个功能可以通过subquery feature来模拟。例如,以下查询返回在过去一小时内值小于0.3的probe_duration_seconds
样本的百分比:
count_over_time((probe_duration_seconds < 0.3)[5h:1m])
/
count_over_time((probe_duration_seconds)[5h:1m])
这个查询期望Prometheus每分钟收集一次原始样本 - 在方括号中的冒号后面看到1m
。请将其设置为您的实际采集间隔,以获得更准确的结果。
P.S. VictoriaMetrics - 我正在开发的一种类似Prometheus的替代解决方案 - 提供了share_le_over_time()函数,可以替代上面的查询:
share_le_over_time(probe_duration_seconds[5h], 0.3)
这种方法相对于基于子查询的方法有以下优点:
- 更容易编写和维护。
- 可以适用于原始样本之间的任何scrape_interval,无需为不同的scrape_interval调整查询。
- 执行速度更快,执行过程中内存消耗更少,因为初始方法中的子查询可能会为较小的scrape间隔和较大的回溯窗口分配大量内存。
英文:
Prometheus doesn't provide a function, which could return the percentage of raw samples with values smaller than the given threshold on the given lookbehind window. This functionality can be emulated via subquery feature. For example, the following query returns the percentage of probe_duration_seconds
samples with the values smaller than 0.3 during the last hour:
count_over_time((probe_duration_seconds < 0.3)[5h:1m])
/
count_over_time((probe_duration_seconds)[5h:1m])
This query expects that the raw samples are collected by Prometheus every minute - see 1m
after the colon in square brackets. Set it to your real scrape interval for more accurate results.
P.S. VictoriaMetrics - an alternative Prometheus-like solution I work on - provides share_le_over_time() function, which can be used instead of the query above:
share_le_over_time(probe_duration_seconds[5h], 0.3)
This approach has the following advantages over the subquery-based approach:
- It is easier to write and maintain.
- It works with any scrape_interval between raw samples - there is no need in adjusting the query for different scrape intervals.
- It works faster than the initial approach and consumes less memory during the execution, since the subquery in the initial approach may allocate big amounts of memory for small scrape intervals and big lookbehind windows.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论