2022年3月9日 18:55:30go评论94阅读模式

英文:

How do I instrument region and environment information correctly in Prometheus?

问题

我有一个应用程序，每个AWS区域都运行一个实例。
我正在尝试使用Prometheus指标客户端来对应用程序代码进行仪表化，并将收集到的指标暴露给/metrics端点。有一个中央服务器将从所有区域的/metrics端点中抓取数据，并将其存储在一个中央时间序列数据库中。

假设我定义了一个名为http_responses_total的指标，我想知道它在所有区域上的聚合值以及各个区域的值。
我应该如何存储这个region信息，它可以是13个区域中的任何一个，并且还有可能是dev、test或prod的env信息，以便我可以根据region和env对指标进行切片和分析？

我找到了一些方法，但不确定通常是如何完成的，因为这似乎是一个非常常见的场景：

将region和env信息作为每个指标的标签存储（不推荐：https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels）
使用目标标签 - 我在应用程序中有region和env的值，并希望从应用程序本身设置这些信息，而不是在抓取配置中设置它们
保留一个单独的仪表指标来记录region和env信息作为标签（如此处所述：https://www.robustperception.io/exposing-the-software-version-to-prometheus） - 这是我计划在时间序列数据库中存储应用程序的version信息的方式，但是应用程序version信息和region信息之间的区别是：版本在发布之间会发生变化，而区域是从配置文件中获取的常量。所以，不确定这是否是一个好的方法。

我对Prometheus还不熟悉。请问有人可以建议我如何存储这个region和env信息吗？是否有其他更好的方法？

英文:

I've an application, and I'm running one instance of this application per AWS region.
I'm trying to instrument the application code with Prometheus metrics client, and will be exposing the collected metrics to the /metrics endpoint. There is a central server which will scrape the /metrics endpoints across all the regions and will store them in a central Time Series Database.

Let's say I've defined a metric named: http_responses_total then I would like to know its value aggregated over all the regions along with individual regional values.
How do I store this region information which could be any one of the 13 regions and env information which could be dev or test or prod along with metrics so that I can slice and dice metrics based on region and env?

I found a few ways to do it, but not sure how it's done in general, as it seems a pretty common scenario:

Storing region and env info as labels with each of the metrics (not recommended: https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels)
Using target labels - I have region and env value with me in the application and would like to set this information from the application itself instead of setting them in scrape config
Keeping a separate gauge metric to record region and env info as labels (like described here: https://www.robustperception.io/exposing-the-software-version-to-prometheus) - this is how I'm planning to store my application version info in tsdb but the difference between app version info and region info is: the version keeps changing across releases however region is which I get from the config file is constant. So, not sure if this is a good way to do it.

I'm new to Prometheus. Could someone please suggest how I should store this region and env information? Are there any other better ways?

答案1

得分: 2

所有提出的选项都可以工作，它们都有缺点。

第一个选项（在应用程序中使用env和region来公开每个指标）易于实现但难以维护。最终，有人会忘记这些内容，从而导致可能发生未被观察到的故障。除此之外，您可能无法将这些标签添加到其他由他人编写的导出器中。最后，如果您必须处理数百万个时间序列，那么更多的纯文本数据意味着更多的流量。

第二个选项（将这些标签作为 Prometheus 抓取配置的一部分添加）是我会选择的。为了简洁起见，考虑以下监控设置：

数据中心 Prometheus	区域 Prometheus	全局 Prometheus
1. 从本地实例收集指标。2. 为每个指标添加 `dc` 标签。3. 将数据推送到区域 Prometheus ->	1. 在数据中心范围内收集数据。2. 为所有指标添加 `region` 标签。3. 将数据推送到全局实例 ->	简单地收集和存储全局范围的数据

*请参阅下面的注释

第三个选项（将这些标签存储在单独的指标中）将使编写和理解查询变得相当困难。以以下查询为例：

sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version="0.17.0"}

它计算了具有 node-exporter version="0.17.0" 的实例的 node_arp_entries 的总和。更具体地说，它计算了每个实例的总和，然后只丢弃那些具有错误版本的实例，但您已经了解了这个想法。

选项2 是适用于 Google 规模的设置，但关键在于简单性。每个标签的来源和原因都非常清晰。这种方法要求您使 Prometheus 配置变得稍微复杂一些，而且您拥有的 Prometheus 实例越少，您将需要更多的抓取配置。总的来说，我认为这个选项胜过其他选择。

英文:

All the proposed options will work, and all of them have downsides.

The first option (having env and region exposed by the application with every metric) is easy to implement but hard to maintain. Eventually somebody will forget to about these, opening a possibility for an unobserved failure to occur. Aside from that, you may not be able to add these labels to other exporters, written by someone else. Lastly, if you have to deal with millions of time series, more plain text data means more traffic.

The second* option (adding these labels with Prometheus as a part of scrape configuration) is what I would choose. To save the words, consider this monitoring setup:

Datacenter Prometheus	Regional Prometheus	Global Prometheus
1. Collects metrics from local instances. 2. Adds `dc` label to each metric. 3. Pushes the data into the regional Prometheus ->	1. Collects data on datacenter scale. 2. Adds `region` label to all metrics. 3. Pushes the data into the global instance ->	Simply collects and stores the data on global scale
> *see note below

The third option (storing these labels in a separate metric) will make it quite difficult to write and understand queries. Take this one for example:

sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version=&quot;0.17.0&quot;}

It calculates a sum of node_arp_entries for instances with node-exporter version="0.17.0". Well more specifically it calculates a sum for every instance and then just drops those with a wrong version, but you got the idea.

OPTION 2 is the kind of setup you need on Google scale, but the point is the simplicity. It's perfectly clear where each label comes from and why. This approach requires you to make Prometheus configuration somewhat more complicated, and the less Prometheus instances you have, the more scrape configurations you will need. Overall, I think, this option beats the alternatives.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Prometheus中正确地记录区域和环境信息？

问题

答案1

Golang在从“main”文件夹导入时出现无效的导入路径。

Golang中的mongoDB搜索文档

为什么 ZeroMQ 上下文（Context）不能在所有 goroutine 之间共享？

Go Tour练习：错误处理：使用Sprintf和%f来避免无限递归。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论