2020年8月2日 18:04:31go评论111阅读模式

英文:

Create JVM heapdump when K8s healthcheck restarts the pod - no OOM occur

问题

我有一个情况，突然发生了一个非常长的GC暂停，我需要找出突然内存分配的来源。这个长时间的GC暂停（大约30秒）导致Pod连续失败了几个K8s健康检查，Pod被重新启动，实际上并没有发生OOM。我想在K8s实际重新启动Pod之前创建一个堆转储。我意识到堆转储应该保存到某个外部持久挂载上。

我唯一的想法是使用preStop钩子来触发堆转储。问题是，当Pod因健康检查失败而重新启动时，是否会触发preStop钩子？

也许有一个更加优雅的解决方案？

英文:

I have a situation when all of a sudden a really long GC pause occurs and I need to find out what is the source of the sudden memory allocation. The long GC pause (around 30 seconds) causes the pod to fail several K8s health checks in a row and the pod gets restarted, without OOM actually happening. I want to create a heap dump before the K8s actually restarts the pod. I realise that the dump should be done to some external persistent mount.

The only idea I have of how to cause the heap dump to occur is to use the preStop hook.
The question is, whether the preStop hook is fired when the pod is restarted because of health check failure?

Maybe there is a more elegant solution to this?

答案1

得分: 3

> The question is, whether the preStop hook is fired when the pod is restarted because of health check failure?

是的。根据定义，PreStop 钩子在容器由于 API 请求或管理事件（例如存活探针失败、抢占、资源争用等）导致终止之前立即运行。

> Should I use preStop hook to capture Java Heap Dump before pod termination?

是的。但需要小心，如果容器已经处于终止或完成状态，调用 preStop 钩子会失败。当pod 终止时，它会等待默认的 30 秒宽限期（如果 PerStop 钩子未完成，则额外增加 2 秒），然后发送 KILL 信号。如果 preStop 钩子需要更长时间才能完成，您必须修改 terminationGracePeriodSeconds 以适应此情况。

> Any more elegant solution to this?

没有我知道的更加优雅的解决方案。我猜通过向 pod 添加一个空目录卷，并配置 JVM 将堆转储到该目录 command: ["java", "-XX:+HeapDumpOnOutOfMemoryError", "-XX:HeapDumpPath=/dumps/oom.bin", "-jar", "yourapp.jar"] 应该可以工作。

> Why the above solution will work?

当 Kubernetes 杀死您的容器，因为它未响应健康检查时，Kubernetes 会重新启动容器，但不会重新调度 pod，因此不会将其移动到另一个节点。因此，直到 pod 被移到另一个节点之前，空目录卷不会被删除。因此，当容器重新启动时，新容器将挂载相同的空目录，其中包含先前运行的堆转储。因此，您可以在事件之后的任何时候使用 kubectl cp 复制这些文件。复制堆转储文件可能存在其他挑战，但它们是可以解决的。查看此处以获取更多信息。

英文:

> The question is, whether the preStop hook is fired when the pod is
> restarted because of health check failure?

Yes. As per the definition, PreStop hook runs immediately before a container is terminated due to an API request or management event such as liveness probe failure, preemption, resource contention and others.

> Should I use preStop hook to capture Java Heap Dump before pod
> termination?

Yes. But you need to be careful, a call to the preStop hook fails if the container is already in terminated or completed state. When the pod is terminated, it waits for default 30 second grace period (with additional 2 seconds if PerStop hook is not completed) before sending KILL signal. If the preStop hook needs longer to complete than the default grace period allows, you must modify terminationGracePeriodSeconds to suit this.

> Any more elegant solution to this?

Not I am aware of. I guess by adding an empty dir volume to the pod, and configuring the JVM to do the heap dumps to that directory command: ["java", "-XX:+HeapDumpOnOutOfMemoryError", "-XX:HeapDumpPath=/dumps/oom.bin", "-jar", "yourapp.jar"] should work.

> Why the above solution will work?

When kubernetes kills your container because it is not responding to the health check, the kubernetes will just restart the container, but it will not reschedule the pod, so it will not move it to another node. Hence the empty dir volume is not deleted until the pod is moved to another node. Hence when the container is restarted, the new container will mount the same empty dir, which will contain the heap dump from the previous run. So you can kubectl cp those files at any time after the event. There might be other challenges to copy the heap dump files but they are solvable. Check this for more info.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Create JVM heapdump when K8s healthcheck restarts the pod – no OOM occur

问题

答案1

使用Ajax和PHP发送表单数据而不刷新页面和URL路径。

在进行测试时，显示在Gradle中无法访问<classname>。

How to find difference (line-based) in sorted large text files in Java without loading them in full into memory?

如何从 xml.gz 文件中提取 XML？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。