英文:
How to get metric for a spark pod OOMKilled using prometheus
问题
我有一个Spark执行器Pod,当它进入OOMKilled状态时,我想要进行警报。我正在使用Prometheus将Spark指标导出到Grafana。
我尝试了一些查询:
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
kube_pod_container_status_terminated_reason{reason="OOMKilled"}
它们似乎没有给出正确的结果。我正在使用记录OOMKilled的Humio日志进行交叉检查,它能够正确记录OOMKilled。
container_memory_failures_total{pod="<<pod_name>>"}
即使使用这个查询,也无法捕获到与Humio日志同步的OOMKilled问题。是否有其他适当的指标可以捕获OOMKilled?
英文:
I have a spark executor pod, which when goes to OOMKilled status, I want to alert it. I am exporting spark metrics using prometheus to grafana.
I have tried some queries to
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
kube_pod_container_status_terminated_reason{reason="OOMKilled"}
They don't seem to give proper results. I am cross checking the result using humio logs, which is logging the OOMKilled properly.
container_memory_failures_total{pod="<<pod_name>>"}
Even this is not able to capture the problems of OOMKilled which is in sync with the humio logs. Is there any other proper metric to catch OOMKilled ?
答案1
得分: 1
以下是您要翻译的内容:
我知道有两个度量标准可以帮助您监控OOM。
第一个用于跟踪主进程/pid的OOMKilled状态。如果它违反了限制,Pod将以此状态重新启动。
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
第二个用于收集容器内OOM事件的总计数。因此,每当某个子进程或其他进程违反RAM限制时,它们将被杀死,度量计数器将增加。但容器将继续正常工作。
container_oom_events_total
英文:
As i know there is two metrics which allow you to monitor OOM.
The first one is used for tracking OOMKilled status of your main process/pid. If it breach the limit pod will be restarted with this status.
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
And the second one for gathering total count of OOM events inside the container. So every time some child process or other process will breach the RAM limit they will be just killed and metric counter increased. But the container will be working as usual.
container_oom_events_total
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论