2023年7月10日 20:57:30go评论88阅读模式

英文:

A Kubernetes worker node is being "ignored" for metrics scraping

问题

我正在使用Prometheus和Grafana来收集和显示Kubernetes集群的指标信息。在这种情况下，我正在收集内存信息，并发现其中一个工作节点在某些指标的结果中没有出现，而在其他指标中出现了。我唯一能看到可能与此有关的事情是该节点有一个污点应用。

以下是节点污点：

nodeType=runner-node:NoExecute

其他工作节点没有（明显的）污点。这是否可能是没有采集到数据的原因？

以下是一个包含此节点信息的指标示例（arc-worker-4）：

查询：

machine_memory_bytes{node="arc-worker-4"}

结果：

指标	值
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"}	135090417664

如果运行另一个指标查询，我得不到结果：

查询：

node_memory_MemTotal_bytes{node="arc-worker-4"}

结果：

空查询结果

在名为node_memory_..._bytes的指标组中（大约有50个），没有一个指标包含此节点的数据。为什么？我可以获得所有其他节点的数据，包括主节点。

英文:

I am using Prometheus and Grafana to collect and display metrics information for a Kubernetes cluster. In this case, I am collecting memory information and have discovered that one of the worker nodes does not appear in the results for certain metrics, while it does for other metrics. The only thing I can see that might have something to do with this, is that that node has a taint applied.

Here is the node taint:

nodeType=runner-node:NoExecute

The rest of the worker nodes have no (obvious) taint. Could this be the reason why nothing is being scraped?

Here is an exmaple of a metric that has information for this node (arc-worker-4):

Query:

machine_memory_bytes{node="arc-worker-4"}

Result:

metric	value
machine_memory_bytes{boot_id="3b6af3e8-d3ae-457a-92be-f7da2adededf", endpoint="https-metrics", instance="172.20.32.14:10250", job="kubelet", machine_id="6c59590e61484bfca6f8da38897d7760", metrics_path="/metrics/cadvisor", namespace="kube-system", node="arc-worker-4", service="prometheus-kube-prometheus-kubelet", system_uuid="c7874d56-2d9d-ce1a-986f-1f549f1784b6"}	135090417664

If run a query on another metric I get no result:

Query:

node_memory_MemTotal_bytes{node="arc-worker-4"}

Result:

Empty query result

In the group of metrics named node_memory_..._bytes (of which there are about 50), none of these have any data for this node. Why? I get data for all other nodes, including the master node.

答案1

得分: 0

通过将容忍（toleration）添加到Prometheus（kube-prometheus-stack）配置中，成功解决了这个问题。这允许Prometheus附带的node-exporter部署到具有该容忍标记的节点上。现在我可以从node_memory_..._bytes指标系列中获取结果。

所做的工作：

在Prometheus Helm图表的values.yaml中，添加了以下内容：

prometheus-node-exporter:
  tolerations:
    - effect: NoSchedule
      operator: Exists
    - key: nodeType
      operator: Equal
      value: runner-node
      effect: NoExecute

第一个容忍是默认的，但需要在这里指定，否则它会被清除。我需要它来确保主节点仍然被抓取。

英文:

Was able to resolve this problem by adding a toleration into the Prometheus (kube-prometheus-stack) config. This allows the node-exporter that came with Prometheus to be deployed onto the node with that taint. I now am getting results from the node_memory_..._bytes family of metrics.

What was done:

In the Prometheus Helm chart values.yaml, the following was added:

  prometheus-node-exporter:
    tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: nodeType
        operator: Equal
        value: runner-node
        effect: NoExecute

The first toleration is the default, but needs to be specified here otherwise it's blown away. I needed it so that the master node would still be scraped.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

一个 Kubernetes 工作节点正在被 “忽略” 以进行指标采集。

问题

答案1

如何在每个 Pod 中获取所有 Kubernetes Pod 的 IP？

PySpark 自定义 UDF 模块未找到错误

在Grafana中的用户角色正在恢复为查看者。

Apache Spark spark-submit k8s API https ERROR

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。