英文:
Prometheus query by label with range vectors
问题
我在我的应用程序中定义了许多计数器(使用Java Micrometer),为了触发警报,我会使用"error":"alert"对我想要监视的计数器进行标记,因此像{error="alert"}
这样的查询将生成多个范围向量:
error_counter_component1{error="alert", label2="random"}
error_counter_component2{error="alert", label2="random2"}
error_counter_component3{error="none", label2="random3"}
我无法控制计数器的名称,我只能将标签添加到我想在警报中使用的计数器中。我想要的警报是,如果所有带有error="alert"
标签的计数器在一小时内增加超过3次,那么我可以使用这种查询:increase({error="alert"}[1h]) > 3
,但是我在Prometheus中遇到了以下错误:Error executing query: vector cannot contain metrics with the same labelset
是否有一种方法可以合并两个范围向量,或者是否应该在计数器的名称中包含某种标签?或者我应该为错误设置一个单独的计数器,标签应该指定源,类似于这样:
errors_counter{source="component1", use_in_alert="yes"}
errors_counter{source="component2", use_in_alerts="yes"}
errors_counter{source="component3", use_in_alerts="no"}
英文:
I'm defining a lot of counters in my app (using java micrometer) and in order to trigger alerts I tag the counters which I want to monitor with "error":"alert" so a query like {error="alert"}
will generate multiple range vectors:
error_counter_component1{error="alert", label2="random"}
error_counter_component2{error="alert", label2="random2"}
error_counter_component3{error="none", label2="random3"}
I don't control the name of the counters I can only add the label to the counters I want to use in my alert. The alert that I want to have is if all the counters labeled with error="alert" increase more then 3 in one hour so I could use this kind of query: increase({error="alert"}[1h]) > 3
but I get the fallowing error in Prometheus: Error executing query: vector cannot contain metrics with the same labelset
Is there a way to merge two range vectors or should I include some kind of tag in the name of the counter? Or should I have a single counter for errors and the tags should specify the source something like this:
errors_counter{source="component1", use_in_alert="yes"}
errors_counter{source="component2", use_in_alerts="yes"}
errors_counter{source="component3", use_in_alerts="no"}
答案1
得分: 1
带有source="componentX"
标签的版本更符合Prometheus数据模型。这是基于假设error_counter
指标确实是一个指标,并且除了source
标签值外,它将具有相同的标签等(例如,它是由相同的库或框架发出的)。
添加诸如use_in_alerts
标签之类的内容并不是一个很好的解决方案。这样的标签不能识别时间序列。
我建议在构建警报查询的地方放置要警报的组件列表,并动态创建单独的警报规则(而无需将此类标签添加到原始数据中)。
另一种解决方案是使用一个单独的伪指标,仅用于提供有关组件的元数据,例如:
component_alert_on{source="component2"} 1
并将其与警报规则组合,仅对您需要的组件发出警报。它可以以任何可能的方式生成,但一个可能性是将其添加到静态记录规则中。这样做的缺点是在某种程度上会使警报查询变得复杂。
当然,use_in_alerts
标签在只对此指标发出警报时也可能有效(至少在您只对此指标发出警报时有效)。
英文:
The version with source="componentX"
label is much more fitting to prometheus data model. This is assuming the error_counter
metric is really one metric and other than source
label value it will have same labels etc. (for example it is emitted by the same library or framework).
Adding stuff like use_in_alerts
label is not a great solution. Such label does not identify time series.
I'd say put a list of components to alert on somewhere where your alerting queries are constructed and dynamically create separate alerting rules (without adding such label to raw data).
Other solution is to have a separate pseudo metric that will obnly be used to provide metadata about components, like:
component_alert_on{source="component2"} 1
and. combine it in alerting rule to only alert on components you need. It can be generated in any possible way, but one possibility is to have it added in static recording rule. This has the con of complicating alerting query somehow.
But of course use_in_alerts
label will also probably work (at least while you are only alerting on this metric).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论