一个 Kubernetes 工作节点正在被 “忽略” 以进行指标采集。

huangapple go评论64阅读模式
英文:

A Kubernetes worker node is being "ignored" for metrics scraping

问题

我正在使用Prometheus和Grafana来收集和显示Kubernetes集群的指标信息。在这种情况下,我正在收集内存信息,并发现其中一个工作节点在某些指标的结果中没有出现,而在其他指标中出现了。我唯一能看到可能与此有关的事情是该节点有一个污点应用。

以下是节点污点:

nodeType=runner-node:NoExecute

其他工作节点没有(明显的)污点。这是否可能是没有采集到数据的原因?

以下是一个包含此节点信息的指标示例(arc-worker-4):

查询:

machine_memory_bytes{node="arc-worker-4"}

结果:

指标
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"} 135090417664

如果运行另一个指标查询,我得不到结果:

查询:

node_memory_MemTotal_bytes{node="arc-worker-4"}

结果:

空查询结果

在名为node_memory_..._bytes的指标组中(大约有50个),没有一个指标包含此节点的数据。为什么?我可以获得所有其他节点的数据,包括主节点。

英文:

I am using Prometheus and Grafana to collect and display metrics information for a Kubernetes cluster. In this case, I am collecting memory information and have discovered that one of the worker nodes does not appear in the results for certain metrics, while it does for other metrics. The only thing I can see that might have something to do with this, is that that node has a taint applied.

Here is the node taint:

nodeType=runner-node:NoExecute

The rest of the worker nodes have no (obvious) taint. Could this be the reason why nothing is being scraped?

Here is an exmaple of a metric that has information for this node (arc-worker-4):

Query:

machine_memory_bytes{node="arc-worker-4"}

Result:

metric value
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"} 135090417664

If run a query on another metric I get no result:

Query:

node_memory_MemTotal_bytes{node="arc-worker-4"}

Result:

Empty query result

In the group of metrics named node_memory_..._bytes (of which there are about 50), none of these have any data for this node. Why? I get data for all other nodes, including the master node.

答案1

得分: 0

通过将容忍(toleration)添加到Prometheus(kube-prometheus-stack)配置中,成功解决了这个问题。这允许Prometheus附带的node-exporter部署到具有该容忍标记的节点上。现在我可以从node_memory_..._bytes指标系列中获取结果。

所做的工作:

在Prometheus Helm图表的values.yaml中,添加了以下内容:

prometheus-node-exporter:
  tolerations:
    - effect: NoSchedule
      operator: Exists
    - key: nodeType
      operator: Equal
      value: runner-node
      effect: NoExecute

第一个容忍是默认的,但需要在这里指定,否则它会被清除。我需要它来确保主节点仍然被抓取。

英文:

Was able to resolve this problem by adding a toleration into the Prometheus (kube-prometheus-stack) config. This allows the node-exporter that came with Prometheus to be deployed onto the node with that taint. I now am getting results from the node_memory_..._bytes family of metrics.

What was done:

In the Prometheus Helm chart values.yaml, the following was added:

  prometheus-node-exporter:
    tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: nodeType
        operator: Equal
        value: runner-node
        effect: NoExecute

The first toleration is the default, but needs to be specified here otherwise it's blown away. I needed it so that the master node would still be scraped.

huangapple
  • 本文由 发表于 2023年7月10日 20:57:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76653981.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定