如何使用Prometheus获取Spark Pod因OOMKilled而产生的度量指标。

huangapple go评论87阅读模式
英文:

How to get metric for a spark pod OOMKilled using prometheus

问题

我有一个Spark执行器Pod,当它进入OOMKilled状态时,我想要进行警报。我正在使用Prometheus将Spark指标导出到Grafana。

我尝试了一些查询:

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
kube_pod_container_status_terminated_reason{reason="OOMKilled"}

它们似乎没有给出正确的结果。我正在使用记录OOMKilled的Humio日志进行交叉检查,它能够正确记录OOMKilled。

container_memory_failures_total{pod="<<pod_name>>"}

即使使用这个查询,也无法捕获到与Humio日志同步的OOMKilled问题。是否有其他适当的指标可以捕获OOMKilled?

英文:

I have a spark executor pod, which when goes to OOMKilled status, I want to alert it. I am exporting spark metrics using prometheus to grafana.

I have tried some queries to

kube_pod_container_status_last_terminated_reason{reason=&quot;OOMKilled&quot;}
kube_pod_container_status_terminated_reason{reason=&quot;OOMKilled&quot;}

They don't seem to give proper results. I am cross checking the result using humio logs, which is logging the OOMKilled properly.

container_memory_failures_total{pod=&quot;&lt;&lt;pod_name&gt;&gt;&quot;}

Even this is not able to capture the problems of OOMKilled which is in sync with the humio logs. Is there any other proper metric to catch OOMKilled ?

答案1

得分: 1

以下是您要翻译的内容:

我知道有两个度量标准可以帮助您监控OOM。
第一个用于跟踪主进程/pid的OOMKilled状态。如果它违反了限制,Pod将以此状态重新启动。

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

第二个用于收集容器内OOM事件的总计数。因此,每当某个子进程或其他进程违反RAM限制时,它们将被杀死,度量计数器将增加。但容器将继续正常工作。

container_oom_events_total
英文:

As i know there is two metrics which allow you to monitor OOM.
The first one is used for tracking OOMKilled status of your main process/pid. If it breach the limit pod will be restarted with this status.

kube_pod_container_status_last_terminated_reason{reason=&quot;OOMKilled&quot;}

And the second one for gathering total count of OOM events inside the container. So every time some child process or other process will breach the RAM limit they will be just killed and metric counter increased. But the container will be working as usual.

container_oom_events_total

huangapple
  • 本文由 发表于 2023年5月25日 22:43:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333545.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定