2023年5月8日 00:37:08go评论88阅读模式

英文:

Trouble with Prometheus metrics (Adapter and metricsQuery)

问题

I can provide a Chinese translation for the text you provided:

原始问题。我想要一个至少有2个节点且不使用GPU的Kubernetes集群。如果有一个任务占用了一个节点，那么自动伸缩器应该创建另一个空闲节点。

我发现我可以依赖于DCGM_FI_DEV_GPU_UTIL指标。如果DCGM_FI_DEV_GPU_UTIL == 0，那么节点处于“空闲”模式。在PromQL中，我只需编写count(DCGM_FI_DEV_GPU_UTIL == 0)就可以获取“空闲”节点的数量。

然而，我不明白如何在Prometheus适配器配置中编写metricsQuery。我找到的所有示例都是关于

然而，我需要类似count(<<.Series>> == 0)的东西，但这不起作用。有什么办法可以让我获得HPA的这些指标，指示没有GPU消耗的节点数量吗？

英文:

Original problem. I would like to have a Kubernetes cluster with at least 2 nodes with zero GPU consumption. If a job is coming and takes one node, then autoscaler should create another spare node.

I found out that I can rely on DCGM_FI_DEV_GPU_UTIL metrics. If DCGM_FI_DEV_GPU_UTIL == 0 then the node is in "idle" mode. In PromQL I can just write count(DCGM_FI_DEV_GPU_UTIL == 0) and get the number of "idle" nodes.

However, I do not understand how to write metricsQuery in Prometheus Adapter config. All examples that I found are about

(sum(rate(&lt;&lt;.Series&gt;&gt;{&lt;&lt;.LabelMatchers&gt;&gt;}[1m])) by (&lt;&lt;.GroupBy&gt;&gt;)

However, I need something like count(<<.Series>> == 0), but this does not work. Any idea how I can get this metrics for HPA which indicates the number of nodes with no GPU consumption?

答案1

得分: 1

你的工作可能在 Kubernetes Pod 中运行。您可能有一个配置，其中每个单一节点只能运行一个自定义 Pod 作业。第一步是配置 Prometheus 适配器的指标，这在这里有很好的描述。这一步将确保添加 Pod。

第二步是配置一个集群自动缩放器，在需要时添加另一个节点。集群自动缩放器依赖于您的 Kubernetes 解决方案提供商（AWS、Azure、GCP...），应该在他们的文档中找到。我个人使用Cluster autoscaler、Karpenter。

英文:

Probably your jobs are running in Kubernetes Pod. You may have a configuration where only one custom Pod with job can run on a single Node. The first step is to configure your metrics for the Prometheus adapter and it's described quite nicely here. This step will ensure that the Pod is added.

In the second step you need to configure a cluster autoscaler that will add another Node when needed. Cluster autoscaler is dependent on your Kubernetes solution provider (AWS, Azure, GCP...) and should be in their documentation. I personally use Cluster autoscaler, Karpenter.

答案2

得分: 0

我最终选择了KEDA与Prometheus触发器。它易于使用并支持PromQL查询。唯一的缺点是它是"平均值"缩放器，但在我的情况下不是关键。

英文:

I ended up with KEDA with the prometheus trigger. It is easy to use and supports PromQL query. The only disadvantage that it is "average value" scaler, but it is not critical in my case.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有关Prometheus指标的问题（适配器和指标查询）

问题

答案1

答案2

有没有一种简单的方法可以使用Helm从文件中获取第一行？

如何编写shell脚本以获取Kubernetes集群中的Pod状态。

如何获取Grafana的Mysql Exporter仪表板以获取数据？

Java耗尽内存。这不是内存泄漏吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。