Prometheus按标签进行查询,使用范围向量

huangapple go评论69阅读模式
英文:

Prometheus query by label with range vectors

问题

我在我的应用程序中定义了许多计数器(使用Java Micrometer),为了触发警报,我会使用"error":"alert"对我想要监视的计数器进行标记,因此像{error="alert"}这样的查询将生成多个范围向量:

error_counter_component1{error="alert", label2="random"}
error_counter_component2{error="alert", label2="random2"}
error_counter_component3{error="none", label2="random3"}

我无法控制计数器的名称,我只能将标签添加到我想在警报中使用的计数器中。我想要的警报是,如果所有带有error="alert"标签的计数器在一小时内增加超过3次,那么我可以使用这种查询:increase({error="alert"}[1h]) > 3,但是我在Prometheus中遇到了以下错误:Error executing query: vector cannot contain metrics with the same labelset

是否有一种方法可以合并两个范围向量,或者是否应该在计数器的名称中包含某种标签?或者我应该为错误设置一个单独的计数器,标签应该指定源,类似于这样:

errors_counter{source="component1", use_in_alert="yes"}
errors_counter{source="component2", use_in_alerts="yes"}
errors_counter{source="component3", use_in_alerts="no"}
英文:

I'm defining a lot of counters in my app (using java micrometer) and in order to trigger alerts I tag the counters which I want to monitor with "error":"alert" so a query like {error="alert"} will generate multiple range vectors:

   error_counter_component1{error="alert", label2="random"}
   error_counter_component2{error="alert", label2="random2"}
   error_counter_component3{error="none", label2="random3"}

I don't control the name of the counters I can only add the label to the counters I want to use in my alert. The alert that I want to have is if all the counters labeled with error="alert" increase more then 3 in one hour so I could use this kind of query: increase({error="alert"}[1h]) > 3 but I get the fallowing error in Prometheus: Error executing query: vector cannot contain metrics with the same labelset

Is there a way to merge two range vectors or should I include some kind of tag in the name of the counter? Or should I have a single counter for errors and the tags should specify the source something like this:

errors_counter{source="component1", use_in_alert="yes"}
errors_counter{source="component2", use_in_alerts="yes"}
errors_counter{source="component3", use_in_alerts="no"}

答案1

得分: 1

带有source="componentX"标签的版本更符合Prometheus数据模型。这是基于假设error_counter指标确实是一个指标,并且除了source标签值外,它将具有相同的标签等(例如,它是由相同的库或框架发出的)。

添加诸如use_in_alerts标签之类的内容并不是一个很好的解决方案。这样的标签不能识别时间序列。
我建议在构建警报查询的地方放置要警报的组件列表,并动态创建单独的警报规则(而无需将此类标签添加到原始数据中)。
另一种解决方案是使用一个单独的伪指标,仅用于提供有关组件的元数据,例如:

component_alert_on{source="component2"} 1 

并将其与警报规则组合,仅对您需要的组件发出警报。它可以以任何可能的方式生成,但一个可能性是将其添加到静态记录规则中。这样做的缺点是在某种程度上会使警报查询变得复杂。
当然,use_in_alerts标签在只对此指标发出警报时也可能有效(至少在您只对此指标发出警报时有效)。

英文:

The version with source="componentX" label is much more fitting to prometheus data model. This is assuming the error_counter metric is really one metric and other than source label value it will have same labels etc. (for example it is emitted by the same library or framework).

Adding stuff like use_in_alerts label is not a great solution. Such label does not identify time series.
I'd say put a list of components to alert on somewhere where your alerting queries are constructed and dynamically create separate alerting rules (without adding such label to raw data).
Other solution is to have a separate pseudo metric that will obnly be used to provide metadata about components, like:

   component_alert_on{source="component2"} 1

and. combine it in alerting rule to only alert on components you need. It can be generated in any possible way, but one possibility is to have it added in static recording rule. This has the con of complicating alerting query somehow.
But of course use_in_alerts label will also probably work (at least while you are only alerting on this metric).

huangapple
  • 本文由 发表于 2020年10月5日 23:20:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/64211524.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定