英文:
Trouble with Prometheus metrics (Adapter and metricsQuery)
问题
I can provide a Chinese translation for the text you provided:
原始问题。我想要一个至少有2个节点且不使用GPU的Kubernetes集群。如果有一个任务占用了一个节点,那么自动伸缩器应该创建另一个空闲节点。
我发现我可以依赖于DCGM_FI_DEV_GPU_UTIL
指标。如果DCGM_FI_DEV_GPU_UTIL == 0
,那么节点处于“空闲”模式。在PromQL中,我只需编写count(DCGM_FI_DEV_GPU_UTIL == 0)
就可以获取“空闲”节点的数量。
然而,我不明白如何在Prometheus适配器配置中编写metricsQuery。我找到的所有示例都是关于
然而,我需要类似count(<<.Series>> == 0)
的东西,但这不起作用。有什么办法可以让我获得HPA的这些指标,指示没有GPU消耗的节点数量吗?
英文:
Original problem. I would like to have a Kubernetes cluster with at least 2 nodes with zero GPU consumption. If a job is coming and takes one node, then autoscaler should create another spare node.
I found out that I can rely on DCGM_FI_DEV_GPU_UTIL
metrics. If DCGM_FI_DEV_GPU_UTIL == 0
then the node is in "idle" mode. In PromQL I can just write count(DCGM_FI_DEV_GPU_UTIL == 0)
and get the number of "idle" nodes.
However, I do not understand how to write metricsQuery in Prometheus Adapter config. All examples that I found are about
(sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)
However, I need something like count(<<.Series>> == 0)
, but this does not work. Any idea how I can get this metrics for HPA which indicates the number of nodes with no GPU consumption?
答案1
得分: 1
你的工作可能在 Kubernetes Pod 中运行。您可能有一个配置,其中每个单一节点只能运行一个自定义 Pod 作业。第一步是配置 Prometheus 适配器的指标,这在这里有很好的描述。这一步将确保添加 Pod。
第二步是配置一个集群自动缩放器,在需要时添加另一个节点。集群自动缩放器依赖于您的 Kubernetes 解决方案提供商(AWS、Azure、GCP...),应该在他们的文档中找到。我个人使用Cluster autoscaler、Karpenter。
英文:
Probably your jobs are running in Kubernetes Pod. You may have a configuration where only one custom Pod with job can run on a single Node. The first step is to configure your metrics for the Prometheus adapter and it's described quite nicely here. This step will ensure that the Pod is added.
In the second step you need to configure a cluster autoscaler that will add another Node when needed. Cluster autoscaler is dependent on your Kubernetes solution provider (AWS, Azure, GCP...) and should be in their documentation. I personally use Cluster autoscaler, Karpenter.
答案2
得分: 0
我最终选择了KEDA与Prometheus触发器。它易于使用并支持PromQL查询。唯一的缺点是它是"平均值"缩放器,但在我的情况下不是关键。
英文:
I ended up with KEDA with the prometheus trigger. It is easy to use and supports PromQL query. The only disadvantage that it is "average value" scaler, but it is not critical in my case.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论