2023年5月25日 22:43:07go评论112阅读模式

英文:

How to get metric for a spark pod OOMKilled using prometheus

问题

我有一个Spark执行器Pod，当它进入OOMKilled状态时，我想要进行警报。我正在使用Prometheus将Spark指标导出到Grafana。

我尝试了一些查询：

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
kube_pod_container_status_terminated_reason{reason="OOMKilled"}

它们似乎没有给出正确的结果。我正在使用记录OOMKilled的Humio日志进行交叉检查，它能够正确记录OOMKilled。

container_memory_failures_total{pod="<<pod_name>>"}

即使使用这个查询，也无法捕获到与Humio日志同步的OOMKilled问题。是否有其他适当的指标可以捕获OOMKilled？

英文:

I have a spark executor pod, which when goes to OOMKilled status, I want to alert it. I am exporting spark metrics using prometheus to grafana.

I have tried some queries to

kube_pod_container_status_last_terminated_reason{reason=&quot;OOMKilled&quot;}
kube_pod_container_status_terminated_reason{reason=&quot;OOMKilled&quot;}

They don't seem to give proper results. I am cross checking the result using humio logs, which is logging the OOMKilled properly.

container_memory_failures_total{pod=&quot;&lt;&lt;pod_name&gt;&gt;&quot;}

Even this is not able to capture the problems of OOMKilled which is in sync with the humio logs. Is there any other proper metric to catch OOMKilled ?

答案1

得分: 1

以下是您要翻译的内容：

我知道有两个度量标准可以帮助您监控OOM。
第一个用于跟踪主进程/pid的OOMKilled状态。如果它违反了限制，Pod将以此状态重新启动。

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

第二个用于收集容器内OOM事件的总计数。因此，每当某个子进程或其他进程违反RAM限制时，它们将被杀死，度量计数器将增加。但容器将继续正常工作。

container_oom_events_total

英文:

As i know there is two metrics which allow you to monitor OOM.
The first one is used for tracking OOMKilled status of your main process/pid. If it breach the limit pod will be restarted with this status.

kube_pod_container_status_last_terminated_reason{reason=&quot;OOMKilled&quot;}

And the second one for gathering total count of OOM events inside the container. So every time some child process or other process will breach the RAM limit they will be just killed and metric counter increased. But the container will be working as usual.

container_oom_events_total

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Prometheus获取Spark Pod因OOMKilled而产生的度量指标。

问题

答案1

MySQL查询 – 日期范围

Grafana: 使用$__timeGroup和dateadd()

如何在Java中使用Kubernetes客户端获取部署的状态。

How can I add a env variable to a kubernetes deployment using golang?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。