如何在Prometheus中正确地记录区域和环境信息?

huangapple go评论82阅读模式
英文:

How do I instrument region and environment information correctly in Prometheus?

问题

我有一个应用程序,每个AWS区域都运行一个实例。
我正在尝试使用Prometheus指标客户端来对应用程序代码进行仪表化,并将收集到的指标暴露给/metrics端点。有一个中央服务器将从所有区域的/metrics端点中抓取数据,并将其存储在一个中央时间序列数据库中。

假设我定义了一个名为http_responses_total的指标,我想知道它在所有区域上的聚合值以及各个区域的值。
我应该如何存储这个region信息,它可以是13个区域中的任何一个,并且还有可能是devtestprodenv信息,以便我可以根据regionenv对指标进行切片和分析?

我找到了一些方法,但不确定通常是如何完成的,因为这似乎是一个非常常见的场景:

  • regionenv信息作为每个指标的标签存储(不推荐:https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels)
  • 使用目标标签 - 我在应用程序中有regionenv的值,并希望从应用程序本身设置这些信息,而不是在抓取配置中设置它们
  • 保留一个单独的仪表指标来记录regionenv信息作为标签(如此处所述:https://www.robustperception.io/exposing-the-software-version-to-prometheus) - 这是我计划在时间序列数据库中存储应用程序的version信息的方式,但是应用程序version信息和region信息之间的区别是:版本在发布之间会发生变化,而区域是从配置文件中获取的常量。所以,不确定这是否是一个好的方法。

我对Prometheus还不熟悉。请问有人可以建议我如何存储这个regionenv信息吗?是否有其他更好的方法?

英文:

I've an application, and I'm running one instance of this application per AWS region.
I'm trying to instrument the application code with Prometheus metrics client, and will be exposing the collected metrics to the /metrics endpoint. There is a central server which will scrape the /metrics endpoints across all the regions and will store them in a central Time Series Database.

Let's say I've defined a metric named: http_responses_total then I would like to know its value aggregated over all the regions along with individual regional values.
How do I store this region information which could be any one of the 13 regions and env information which could be dev or test or prod along with metrics so that I can slice and dice metrics based on region and env?

I found a few ways to do it, but not sure how it's done in general, as it seems a pretty common scenario:

I'm new to Prometheus. Could someone please suggest how I should store this region and env information? Are there any other better ways?

答案1

得分: 2

所有提出的选项都可以工作,它们都有缺点。

第一个选项(在应用程序中使用envregion来公开每个指标)易于实现但难以维护。最终,有人会忘记这些内容,从而导致可能发生未被观察到的故障。除此之外,您可能无法将这些标签添加到其他由他人编写的导出器中。最后,如果您必须处理数百万个时间序列,那么更多的纯文本数据意味着更多的流量。

第二个选项(将这些标签作为 Prometheus 抓取配置的一部分添加)是我会选择的。为了简洁起见,考虑以下监控设置:

数据中心 Prometheus 区域 Prometheus 全局 Prometheus
1. 从本地实例收集指标。2. 为每个指标添加 dc 标签。3. 将数据推送到区域 Prometheus -> 1. 在数据中心范围内收集数据。2. 为所有指标添加 region 标签。3. 将数据推送到全局实例 -> 简单地收集和存储全局范围的数据

*请参阅下面的注释

第三个选项(将这些标签存储在单独的指标中)将使编写和理解查询变得相当困难。以以下查询为例:

sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version="0.17.0"}

它计算了具有 node-exporter version="0.17.0" 的实例的 node_arp_entries 的总和。更具体地说,它计算了每个实例的总和,然后只丢弃那些具有错误版本的实例,但您已经了解了这个想法。


选项2 是适用于 Google 规模的设置,但关键在于简单性。每个标签的来源和原因都非常清晰。这种方法要求您使 Prometheus 配置变得稍微复杂一些,而且您拥有的 Prometheus 实例越少,您将需要更多的抓取配置。总的来说,我认为这个选项胜过其他选择。

英文:

All the proposed options will work, and all of them have downsides.

The first option (having env and region exposed by the application with every metric) is easy to implement but hard to maintain. Eventually somebody will forget to about these, opening a possibility for an unobserved failure to occur. Aside from that, you may not be able to add these labels to other exporters, written by someone else. Lastly, if you have to deal with millions of time series, more plain text data means more traffic.

The second* option (adding these labels with Prometheus as a part of scrape configuration) is what I would choose. To save the words, consider this monitoring setup:

Datacenter Prometheus Regional Prometheus Global Prometheus
1. Collects metrics from local instances. 2. Adds dc label to each metric. 3. Pushes the data into the regional Prometheus -> 1. Collects data on datacenter scale. 2. Adds region label to all metrics. 3. Pushes the data into the global instance -> Simply collects and stores the data on global scale
> *see note below

The third option (storing these labels in a separate metric) will make it quite difficult to write and understand queries. Take this one for example:

sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version="0.17.0"}

It calculates a sum of node_arp_entries for instances with node-exporter version="0.17.0". Well more specifically it calculates a sum for every instance and then just drops those with a wrong version, but you got the idea.


OPTION 2 is the kind of setup you need on Google scale, but the point is the simplicity. It's perfectly clear where each label comes from and why. This approach requires you to make Prometheus configuration somewhat more complicated, and the less Prometheus instances you have, the more scrape configurations you will need. Overall, I think, this option beats the alternatives.

huangapple
  • 本文由 发表于 2022年3月9日 18:55:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/71408188.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定