英文:
A Kubernetes worker node is being "ignored" for metrics scraping
问题
我正在使用Prometheus和Grafana来收集和显示Kubernetes集群的指标信息。在这种情况下,我正在收集内存信息,并发现其中一个工作节点在某些指标的结果中没有出现,而在其他指标中出现了。我唯一能看到可能与此有关的事情是该节点有一个污点应用。
以下是节点污点:
nodeType=runner-node:NoExecute
其他工作节点没有(明显的)污点。这是否可能是没有采集到数据的原因?
以下是一个包含此节点信息的指标示例(arc-worker-4
):
查询:
machine_memory_bytes{node="arc-worker-4"}
结果:
指标 | 值 |
---|---|
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"} | 135090417664 |
如果运行另一个指标查询,我得不到结果:
查询:
node_memory_MemTotal_bytes{node="arc-worker-4"}
结果:
空查询结果
在名为node_memory_..._bytes
的指标组中(大约有50个),没有一个指标包含此节点的数据。为什么?我可以获得所有其他节点的数据,包括主节点。
英文:
I am using Prometheus and Grafana to collect and display metrics information for a Kubernetes cluster. In this case, I am collecting memory information and have discovered that one of the worker nodes does not appear in the results for certain metrics, while it does for other metrics. The only thing I can see that might have something to do with this, is that that node has a taint applied.
Here is the node taint:
nodeType=runner-node:NoExecute
The rest of the worker nodes have no (obvious) taint. Could this be the reason why nothing is being scraped?
Here is an exmaple of a metric that has information for this node (arc-worker-4
):
Query:
machine_memory_bytes{node="arc-worker-4"}
Result:
metric | value |
---|---|
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"} | 135090417664 |
If run a query on another metric I get no result:
Query:
node_memory_MemTotal_bytes{node="arc-worker-4"}
Result:
Empty query result
In the group of metrics named node_memory_..._bytes
(of which there are about 50), none of these have any data for this node. Why? I get data for all other nodes, including the master node.
答案1
得分: 0
通过将容忍(toleration)添加到Prometheus(kube-prometheus-stack)配置中,成功解决了这个问题。这允许Prometheus附带的node-exporter部署到具有该容忍标记的节点上。现在我可以从node_memory_..._bytes
指标系列中获取结果。
所做的工作:
在Prometheus Helm图表的values.yaml中,添加了以下内容:
prometheus-node-exporter:
tolerations:
- effect: NoSchedule
operator: Exists
- key: nodeType
operator: Equal
value: runner-node
effect: NoExecute
第一个容忍是默认的,但需要在这里指定,否则它会被清除。我需要它来确保主节点仍然被抓取。
英文:
Was able to resolve this problem by adding a toleration into the Prometheus (kube-prometheus-stack) config. This allows the node-exporter that came with Prometheus to be deployed onto the node with that taint. I now am getting results from the node_memory_..._bytes
family of metrics.
What was done:
In the Prometheus Helm chart values.yaml, the following was added:
prometheus-node-exporter:
tolerations:
- effect: NoSchedule
operator: Exists
- key: nodeType
operator: Equal
value: runner-node
effect: NoExecute
The first toleration is the default, but needs to be specified here otherwise it's blown away. I needed it so that the master node would still be scraped.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论