2023年3月28日 16:50:03go评论170阅读模式

英文:

Prometheus metrics are overwritten

问题

我有一个在OpenShift上的部署 - host status monitor app。

我还有一个用于Prometheus实例和操作员的部署。

这是一个使用Golang编写的应用程序，执行一些操作并将指标发布到Prometheus。

我在OpenShift中创建了一个服务和一个路由，并使用该路由可以查看所有的Prometheus指标。我正在使用路由中的链接作为Grafana的Prometheus数据源。

现在的问题是，如果我将部署的Pod数量增加到2个，第二个Pod会覆盖第一个Pod发布的所有Prometheus指标。

我希望我的部署有多个Pod，并且它们的Prometheus指标都可以通过同一个路由链接访问，而不会影响其他Pod的指标。我该如何实现？

我正在使用以下函数来发布指标：

func PublishMetrics(hostName, originalState, swarch, server, provtype string) {
	states := []string{STATE_ACTIVE, STATE_UNRESPONSIVE, STATE_INACTIVE, STATE_DECOMISSIONED, STATE_REMOVED}

	for _, state := range states {
		HostStateGauge.With(prometheus.Labels{
			"host":     hostName,
			"state":    state,
			"swarch":   swarch,
			"server":   server,
			"provtype": provtype,
		}).Set(0)
	}
	HostStateGauge.With(prometheus.Labels{
		"host":     hostName,
		"state":    STATE_UNKNOWN,
		"swarch":   "",
		"server":   "",
		"provtype": "",
	}).Set(0)
	HostStateGauge.With(prometheus.Labels{
		"host":     hostName,
		"state":    originalState,
		"swarch":   swarch,
		"server":   server,
		"provtype": provtype,
	}).Set(1)
}

PS：我对OpenShift和Prometheus都不太熟悉。

谢谢。

英文:

I have a deployment on openshift - host status monitor app.

I also have a deployment for prometheus instance and operator.

This is a golang application which performs some operations and publish the metrics on prometheus.

I have created a service and a route in openshift and using this route I can see all my prometheus metrics. I am using the link in the route as prmotheus data source for grafana.
 
Now the problem is, if I increase the pod count for my deployment - host status monitor app to say 2, the 2nd pod overwrites all the prometheus metrics published by the 1st pod.
 
I want to have multiple pods for my deployment and all of their prometheus metrics should come in the same route link without affecting the metrics for other pods. How can I achieve it??

I am using the below function to publish the metrics:

func PublishMetrics(hostName, originalState, swarch, server, provtype string) {
	states := []string{STATE_ACTIVE, STATE_UNRESPONSIVE, STATE_INACTIVE, STATE_DECOMISSIONED, STATE_REMOVED}

	for _, state := range states {
		HostStateGauge.With(prometheus.Labels{
			&quot;host&quot;:     hostName,
			&quot;state&quot;:    state,
			&quot;swarch&quot;:   swarch,
			&quot;server&quot;:   server,
			&quot;provtype&quot;: provtype,
		}).Set(0)
	}
	HostStateGauge.With(prometheus.Labels{
		&quot;host&quot;:     hostName,
		&quot;state&quot;:    STATE_UNKNOWN,
		&quot;swarch&quot;:   &quot;&quot;,
		&quot;server&quot;:   &quot;&quot;,
		&quot;provtype&quot;: &quot;&quot;,
	}).Set(0)
	HostStateGauge.With(prometheus.Labels{
		&quot;host&quot;:     hostName,
		&quot;state&quot;:    originalState,
		&quot;swarch&quot;:   swarch,
		&quot;server&quot;:   server,
		&quot;provtype&quot;: provtype,
	}).Set(1)
}

PS: I am new to openshift and prometheus.
 
Thanks.

答案1

得分: 1

你面临的问题是因为PublishMetrics()函数对每个pod使用相同的Prometheus标签集，导致Prometheus将每个pod的指标聚合在一起。

为了解决这个问题并确保每个pod的指标是独立的，你可以修改PublishMetrics()中使用的标签，以包含每个pod的唯一标识符。

首先读取当前pod的名称，然后将该pod标签添加到所有的指标中。

// 获取当前pod的名称
podName := os.Getenv("HOSTNAME")

"pod":      podName, // 将pod名称作为标签添加
"host":     hostName,
"state":    state,
"swarch":   swarch,
"server":   server,
"provtype": provtype,

通过将pod名称作为标签包含在内，Prometheus将把每个pod的指标视为独立的时间序列，即使它们对于其他标签具有相同的标签值。

英文:

The issue you are facing is because the PublishMetrics() function is using the same set of Prometheus labels for each pod, so the metrics from each pod are being aggregated together by Prometheus.

To solve this issue and ensure that metrics from each pod are separate and distinct, you can modify the labels used in PublishMetrics() to include a unique identifier for each pod.

First read the pod name and then add the pod label to all your metrics.

// Get the name of the current pod
podName := os.Getenv(&quot;HOSTNAME&quot;)


    &quot;pod&quot;:      podName, // Add pod name as a label
    &quot;host&quot;:     hostName,
    &quot;state&quot;:    state,
    &quot;swarch&quot;:   swarch,
    &quot;server&quot;:   server,
    &quot;provtype&quot;: provtype,

By including the pod name as a label, Prometheus will treat the metrics from each pod as separate time series, even if they have the same label values for the other labels.

答案2

得分: 0

以下是一些日志：

# TYPE hoststatusmonitorapp_request_duration_seconds histogram
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.005"} 2776
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.01"} 3146
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.025"} 3220
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.05"} 3250
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.1"} 3327

这是一个不同的函数：

func InitializePrometheusMetrics(appName string) {
	RequestHistogram = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    fmt.Sprintf("%s_request_duration_seconds", appName),
			Help:    fmt.Sprintf("处理 %s 请求所花费的时间（以秒为单位）", appName),
			Buckets: prometheus.DefBuckets,
		},
		[]string{"pod_name", "name", "result"}) // 请求名称和请求结果 != "" => 错误消息
}

// 用于 MQTT 消息接收处理程序的 Prometheus 中间件
func MqttMessageHandlerPrometheusMiddleware(
	msgHandler func(topic string, payload []byte, mongoClient *mongo.Client, ctx context.Context) (string, string, error),
	topic string,
	payload []byte,
	mongoClient *mongo.Client,
	ctx context.Context,
) {
	start := time.Now()
	reqHandlerName, reqHandlerResult, err := msgHandler(topic, payload, mongoClient, ctx)
	if err != nil {
		LogError(fmt.Errorf("在处理主题：%s，时间戳：%s，错误：%v", topic, payload, err))
	}
	RequestHistogram.With(prometheus.Labels{
		"pod_name": GetNormalizedHostname(APP_NAME),
		"name":     reqHandlerName,
		"result":   reqHandlerResult,
	}).Observe(time.Since(start).Seconds())
}

对于上述函数，我找到了一个解决方法。

英文:

Here are some logs:

# TYPE hoststatusmonitorapp_request_duration_seconds histogram
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.005&quot;} 2776
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.01&quot;} 3146
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.025&quot;} 3220
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.05&quot;} 3250
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.1&quot;} 3327

It is a different function:

func InitializePrometheusMetrics(appName string) {
	RequestHistogram = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    fmt.Sprintf(&quot;%s_request_duration_seconds&quot;, appName),
			Help:    fmt.Sprintf(&quot;Time (in seconds) spent serving %s request&quot;, appName),
			Buckets: prometheus.DefBuckets,
		},
		[]string{&quot;pod_name&quot;, &quot;name&quot;, &quot;result&quot;}) //request name and request result != &quot;&quot; =&gt; Error Msg
}

// Prometheus Middleware for MQTT Message Received Handler
func MqttMessageHandlerPrometheusMiddleware(
	msgHandler func(topic string, payload []byte, mongoClient *mongo.Client, ctx context.Context) (string, string, error),
	topic string,
	payload []byte,
	mongoClient *mongo.Client,
	ctx context.Context,
) {
	start := time.Now()
	reqHandlerName, reqHandlerResult, err := msgHandler(topic, payload, mongoClient, ctx)
	if err != nil {
		LogError(fmt.Errorf(&quot;In processing topic: %s, epochtime: %s, err: %v&quot;, topic, payload, err))
	}
	RequestHistogram.With(prometheus.Labels{
		&quot;pod_name&quot;: GetNormalizedHostname(APP_NAME),
		&quot;name&quot;:     reqHandlerName,
		&quot;result&quot;:   reqHandlerResult,
	}).Observe(time.Since(start).Seconds())
}

For the above function I found a way around.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Prometheus指标被覆盖了。

问题

答案1

答案2

Golang – 在获取 JSON 后无法获取对象数组。

How to read from either gzip or plain text reader in golang?

如何在关闭客户端连接后完全读取TCP套接字中的缓冲内容？

为什么在第一次存储之后不能复制 atomic.Value？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论