Prometheus指标被覆盖了。

huangapple go评论75阅读模式
英文:

Prometheus metrics are overwritten

问题

我有一个在OpenShift上的部署 - host status monitor app

我还有一个用于Prometheus实例和操作员的部署。

这是一个使用Golang编写的应用程序,执行一些操作并将指标发布到Prometheus。

我在OpenShift中创建了一个服务和一个路由,并使用该路由可以查看所有的Prometheus指标。我正在使用路由中的链接作为Grafana的Prometheus数据源。

现在的问题是,如果我将部署的Pod数量增加到2个,第二个Pod会覆盖第一个Pod发布的所有Prometheus指标。

我希望我的部署有多个Pod,并且它们的Prometheus指标都可以通过同一个路由链接访问,而不会影响其他Pod的指标。我该如何实现?

我正在使用以下函数来发布指标:

func PublishMetrics(hostName, originalState, swarch, server, provtype string) {
	states := []string{STATE_ACTIVE, STATE_UNRESPONSIVE, STATE_INACTIVE, STATE_DECOMISSIONED, STATE_REMOVED}

	for _, state := range states {
		HostStateGauge.With(prometheus.Labels{
			"host":     hostName,
			"state":    state,
			"swarch":   swarch,
			"server":   server,
			"provtype": provtype,
		}).Set(0)
	}
	HostStateGauge.With(prometheus.Labels{
		"host":     hostName,
		"state":    STATE_UNKNOWN,
		"swarch":   "",
		"server":   "",
		"provtype": "",
	}).Set(0)
	HostStateGauge.With(prometheus.Labels{
		"host":     hostName,
		"state":    originalState,
		"swarch":   swarch,
		"server":   server,
		"provtype": provtype,
	}).Set(1)
}

PS:我对OpenShift和Prometheus都不太熟悉。

谢谢。

英文:

I have a deployment on openshift - host status monitor app.

I also have a deployment for prometheus instance and operator.

This is a golang application which performs some operations and publish the metrics on prometheus.

I have created a service and a route in openshift and using this route I can see all my prometheus metrics. I am using the link in the route as prmotheus data source for grafana.
<br><br>
Now the problem is, if I increase the pod count for my deployment - host status monitor app to say 2, the 2nd pod overwrites all the prometheus metrics published by the 1st pod.
<br><br>
I want to have multiple pods for my deployment and all of their prometheus metrics should come in the same route link without affecting the metrics for other pods. How can I achieve it??

I am using the below function to publish the metrics:

func PublishMetrics(hostName, originalState, swarch, server, provtype string) {
	states := []string{STATE_ACTIVE, STATE_UNRESPONSIVE, STATE_INACTIVE, STATE_DECOMISSIONED, STATE_REMOVED}

	for _, state := range states {
		HostStateGauge.With(prometheus.Labels{
			&quot;host&quot;:     hostName,
			&quot;state&quot;:    state,
			&quot;swarch&quot;:   swarch,
			&quot;server&quot;:   server,
			&quot;provtype&quot;: provtype,
		}).Set(0)
	}
	HostStateGauge.With(prometheus.Labels{
		&quot;host&quot;:     hostName,
		&quot;state&quot;:    STATE_UNKNOWN,
		&quot;swarch&quot;:   &quot;&quot;,
		&quot;server&quot;:   &quot;&quot;,
		&quot;provtype&quot;: &quot;&quot;,
	}).Set(0)
	HostStateGauge.With(prometheus.Labels{
		&quot;host&quot;:     hostName,
		&quot;state&quot;:    originalState,
		&quot;swarch&quot;:   swarch,
		&quot;server&quot;:   server,
		&quot;provtype&quot;: provtype,
	}).Set(1)
}

<br>
PS: I am new to openshift and prometheus.
<br><br>
Thanks.

答案1

得分: 1

你面临的问题是因为PublishMetrics()函数对每个pod使用相同的Prometheus标签集,导致Prometheus将每个pod的指标聚合在一起。

为了解决这个问题并确保每个pod的指标是独立的,你可以修改PublishMetrics()中使用的标签,以包含每个pod的唯一标识符。

首先读取当前pod的名称,然后将该pod标签添加到所有的指标中。

// 获取当前pod的名称
podName := os.Getenv("HOSTNAME")

"pod":      podName, // 将pod名称作为标签添加
"host":     hostName,
"state":    state,
"swarch":   swarch,
"server":   server,
"provtype": provtype,

通过将pod名称作为标签包含在内,Prometheus将把每个pod的指标视为独立的时间序列,即使它们对于其他标签具有相同的标签值。

英文:

The issue you are facing is because the PublishMetrics() function is using the same set of Prometheus labels for each pod, so the metrics from each pod are being aggregated together by Prometheus.

To solve this issue and ensure that metrics from each pod are separate and distinct, you can modify the labels used in PublishMetrics() to include a unique identifier for each pod.

First read the pod name and then add the pod label to all your metrics.

// Get the name of the current pod
podName := os.Getenv(&quot;HOSTNAME&quot;)


    &quot;pod&quot;:      podName, // Add pod name as a label
    &quot;host&quot;:     hostName,
    &quot;state&quot;:    state,
    &quot;swarch&quot;:   swarch,
    &quot;server&quot;:   server,
    &quot;provtype&quot;: provtype,

By including the pod name as a label, Prometheus will treat the metrics from each pod as separate time series, even if they have the same label values for the other labels.

答案2

得分: 0

以下是一些日志:

# TYPE hoststatusmonitorapp_request_duration_seconds histogram
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.005"} 2776
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.01"} 3146
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.025"} 3220
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.05"} 3250
hoststatusmonitorapp_request_duration_seconds_bucket{name="MQTTMsgHandler",pod_name="hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm",result="",le="0.1"} 3327

这是一个不同的函数:

func InitializePrometheusMetrics(appName string) {
	RequestHistogram = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    fmt.Sprintf("%s_request_duration_seconds", appName),
			Help:    fmt.Sprintf("处理 %s 请求所花费的时间(以秒为单位)", appName),
			Buckets: prometheus.DefBuckets,
		},
		[]string{"pod_name", "name", "result"}) // 请求名称和请求结果 != "" => 错误消息
}

// 用于 MQTT 消息接收处理程序的 Prometheus 中间件
func MqttMessageHandlerPrometheusMiddleware(
	msgHandler func(topic string, payload []byte, mongoClient *mongo.Client, ctx context.Context) (string, string, error),
	topic string,
	payload []byte,
	mongoClient *mongo.Client,
	ctx context.Context,
) {
	start := time.Now()
	reqHandlerName, reqHandlerResult, err := msgHandler(topic, payload, mongoClient, ctx)
	if err != nil {
		LogError(fmt.Errorf("在处理主题:%s,时间戳:%s,错误:%v", topic, payload, err))
	}
	RequestHistogram.With(prometheus.Labels{
		"pod_name": GetNormalizedHostname(APP_NAME),
		"name":     reqHandlerName,
		"result":   reqHandlerResult,
	}).Observe(time.Since(start).Seconds())
}

对于上述函数,我找到了一个解决方法。

英文:

Here are some logs:

# TYPE hoststatusmonitorapp_request_duration_seconds histogram
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.005&quot;} 2776
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.01&quot;} 3146
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.025&quot;} 3220
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.05&quot;} 3250
hoststatusmonitorapp_request_duration_seconds_bucket{name=&quot;MQTTMsgHandler&quot;,pod_name=&quot;hoststatusmonitorapp_host_status_monitor_app_5888b67d67_x74vm&quot;,result=&quot;&quot;,le=&quot;0.1&quot;} 3327

It is a different function:

func InitializePrometheusMetrics(appName string) {
	RequestHistogram = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    fmt.Sprintf(&quot;%s_request_duration_seconds&quot;, appName),
			Help:    fmt.Sprintf(&quot;Time (in seconds) spent serving %s request&quot;, appName),
			Buckets: prometheus.DefBuckets,
		},
		[]string{&quot;pod_name&quot;, &quot;name&quot;, &quot;result&quot;}) //request name and request result != &quot;&quot; =&gt; Error Msg
}

// Prometheus Middleware for MQTT Message Received Handler
func MqttMessageHandlerPrometheusMiddleware(
	msgHandler func(topic string, payload []byte, mongoClient *mongo.Client, ctx context.Context) (string, string, error),
	topic string,
	payload []byte,
	mongoClient *mongo.Client,
	ctx context.Context,
) {
	start := time.Now()
	reqHandlerName, reqHandlerResult, err := msgHandler(topic, payload, mongoClient, ctx)
	if err != nil {
		LogError(fmt.Errorf(&quot;In processing topic: %s, epochtime: %s, err: %v&quot;, topic, payload, err))
	}
	RequestHistogram.With(prometheus.Labels{
		&quot;pod_name&quot;: GetNormalizedHostname(APP_NAME),
		&quot;name&quot;:     reqHandlerName,
		&quot;result&quot;:   reqHandlerResult,
	}).Observe(time.Since(start).Seconds())
}

For the above function I found a way around.

huangapple
  • 本文由 发表于 2023年3月28日 16:50:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75864127.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定