在记录请求持续时间时,何时使用prometheus中的gauge或histogram?

huangapple go评论117阅读模式
英文:

When to use gauge or histogram in prometheus in recording request duration?

问题

我是你的中文翻译助手,以下是翻译好的内容:

我对度量监控还不太了解。

如果我们想记录请求的持续时间,我认为应该使用“gauge”,但在实践中,有些人会使用“histogram”。

例如,在grpc-ecosystem/go-grpc-prometheus中,他们更喜欢使用“histogram”来记录持续时间。关于度量类型的使用,是否有统一的最佳实践?还是这只是他们自己的偏好。

谢谢~

英文:

I'm new to metric monitoring.

If we want to record the duration of the requests, I think we should use gauge, but in practise, someone would use histogram.

for example, in grpc-ecosystem/go-grpc-prometheus, they prefer to use histogram to record duration. Are there agreed best practices for the use of metric types? Or it is just their own preference.

// ServerMetrics represents a collection of metrics to be registered on a
// Prometheus metrics registry for a gRPC server.
type ServerMetrics struct {
	serverStartedCounter          *prom.CounterVec
	serverHandledCounter          *prom.CounterVec
	serverStreamMsgReceived       *prom.CounterVec
	serverStreamMsgSent           *prom.CounterVec
	serverHandledHistogramEnabled bool
	serverHandledHistogramOpts    prom.HistogramOpts
	serverHandledHistogram        *prom.HistogramVec
}

Thanks~

答案1

得分: 4

我是新手,但让我试着回答你的问题。所以请对我的回答持保留态度,或者希望有经验的人能够介入,使用指标观察他们的系统。

根据https://prometheus.io/docs/concepts/metric_types/中所述:

Gauge是一种表示可以任意上升和下降的单个数值的指标。

因此,如果你的目标是显示当前值(请求的持续时间),你可以使用Gauge。但我认为使用指标的目标是在系统中发现问题,或者在某些值不在预定义范围内时生成警报,或者为系统获取性能值(如Apdex分数)。

从https://prometheus.io/docs/concepts/metric_types/#histogram中可以了解到:

使用histogram_quantile()函数从直方图或直方图的聚合中计算分位数。直方图也适用于计算Apdex分数。

从https://en.wikipedia.org/wiki/Apdex中可以了解到:

Apdex(应用程序性能指数)是由一些公司联合开发的用于衡量计算机软件应用性能的开放标准。它的目的是通过指定一种统一的方式来分析和报告所测量的性能与用户期望的符合程度,从而将测量结果转化为对用户满意度的洞察。

阅读有关分位数、直方图和摘要中的计算的相关内容,请参考https://prometheus.io/docs/practices/histograms/#quantiles。

两个经验法则:

  1. 如果需要聚合,请选择直方图。
  2. 否则,如果你对将要观察的值的范围和分布有一个概念,请选择直方图。如果你需要一个准确的分位数,无论值的范围和分布如何,请选择摘要。

或者像Adam Woodbeck在他的书《使用Go进行网络编程》中所说:

一般建议是在不知道预期值范围的情况下使用摘要,但我建议你尽可能使用直方图,以便在指标服务器上聚合直方图。

英文:

I am new to this but let me try to answer your question. So take my answer with a grain of salt or maybe someone with experience in using metrics to observe their systems jumps in.

as stated in https://prometheus.io/docs/concepts/metric_types/

> A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

So if your goal would be to display the current value (duration time of requests) you could use a gauge. But I think the goal of using metrics is to find problems within your system or generate alerts if and when certain vaules aren't in a predefined range or getting a performance value (like the Apdex score) for your system.

From https://prometheus.io/docs/concepts/metric_types/#histogram

>Use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms. A histogram is also suitable to calculate an Apdex score.

From https://en.wikipedia.org/wiki/Apdex

>Apdex (Application Performance Index) is an open standard developed by an alliance of companies for measuring performance of software applications in computing. Its purpose is to convert measurements into insights about user satisfaction, by specifying a uniform way to analyze and report on the degree to which measured performance meets user expectations.

Read up on Quantiles and the calculations in histograms and summaries https://prometheus.io/docs/practices/histograms/#quantiles

Two rules of thumb:

  1. If you need to aggregate, choose histograms.
  2. Otherwise, choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.

Or like Adam Woodbeck in his book "Network programming with Go" said:

>The general advice is to use summaries when you don’t know the range of expected values, but I’d advise you to use histograms whenever possible
so that you can aggregate histograms on the metrics server.

答案2

得分: 4

Prometheus中gauge和histogram度量类型的主要区别在于,当Prometheus抓取暴露度量的目标时,它只捕获gauge度量的单个(最后)值,而histogram则通过递增相应的直方图桶来捕获所有度量值。

例如,如果频繁请求的端点测量请求持续时间,并且Prometheus被设置为每30秒抓取您的应用程序(例如,在scrape_configs中的scrape_interval: 30s),那么当持续时间存储在Gauge度量中时,Prometheus每30秒只会抓取最后一个请求的持续时间。所有先前的请求持续时间测量值都会丢失。

另一方面,任意数量的请求持续时间测量都会在Histogram度量中注册,这不依赖于应用程序抓取之间的间隔。稍后,Histogram度量允许在任意时间范围内获取请求持续时间的分布。

然而,Prometheus的直方图存在一些问题:

  • 您需要选择直方图桶的数量和边界,以便对测量度量的分布提供良好的准确性。这并不是一项简单的任务,因为您可能事先不知道度量的真实分布。
  • 如果某个测量的桶数量或其边界发生更改,则histogram_quantile()函数在该测量上返回无效结果。
  • 每个直方图中的太多桶可能会导致高基数问题,因为直方图中的每个桶都会创建一个单独的时间序列

附注:这些问题在VictoriaMetrics直方图中得到了解决(我是VictoriaMetrics的核心开发者)。

英文:

The main difference between gauge and histogram metric types in Prometheus is that Prometheus captures only a single (last) value of the gauge metric when it scrapes the target exposing the metric, while histogram captures all the metric values by incrementing the corresponding histogram bucket.

For example, if request duration is measured for frequently requested endpoint and Prometheus is set up to scrape your app every 30 seconds (e.g. scrape_interval: 30s in scrape_configs), then the Prometheus will scrape only a single duration for the last request every 30 seconds when the duration is stored in a Gauge metric. All the previous measurements for the request duration are lost.

On the other hand, any number of request duration measurement are registered in Histogram metric, and this doesn't depend on the interval between scrapes of your app. Later the Histogram metric allows obtaining the distribution of request durations on an arbitrary time range.

Prometheus histograms have some issues though:

  • You need to choose the number and the boundaries of histogram buckets, so they provide good accuracy for observing the distribution of the measured metric. This isn't a trivial task, since you may not know in advance the real distribution of the metric.
  • If the number of buckets are changed or their boundaries are changed for some measurement, then the histogram_quantile() function returns invalid results over such a measurement.
  • Too big number of buckets per each histogram may result in high cardinality issues, since each bucket in the histogram creates a separate time series.

P.S. these issues are addressed in VcitoriaMetrics histograms (I'm the core developer of VictoriaMetrics).

答案3

得分: 1

如valyala所建议的,主要区别在于直方图聚合数据,因此您可以利用Prometheus统计引擎对所有注册样本进行统计(最小值、最大值、平均值、分位数等)。

计量器更多用于测量例如“风速”、“队列大小”或任何其他类型的“即时数据”,在这种情况下,忽略旧的相关样本并不那么重要,因为您想要了解当前情况。

使用计量器来测量“请求持续时间”需要非常小的抓取周期才能准确,即使您的速率不是很高(如果抓取周期小于应用程序接收速率,您将忽略数据),这在实践中是不可行的。因此,总结起来,不要使用计量器。直方图更适合您的需求。

英文:

As valyala suggest, the main difference is that histogram aggregates data, so you would take advantage of prometheus statistics engine over all registered samples (minimum, maximum, average, quantiles, etc.).

A gauge is more used to measure for example "wind velocity", "queue size", or any other kind of "instant data" where it is not so important to ignore old related samples of it as you want to know current picture.

Using gauges for "duration of the requests" would require very small scrape periods to be accurate, which is not practical even if your rate is not very high (if your scrape period is less than your application reception rate, you will ignore data). So, in summary, don't use gauges. Histogram fits much better your needs.

huangapple
  • 本文由 发表于 2022年4月6日 22:21:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/71768510.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定