2023年3月3日 18:55:47go评论111阅读模式

英文:

Why startup probe is ignored?

问题

我部署了WAS到Kubernetes（版本1.16）。我使用了所有三种类型的探针。

Liveness探针设置为检查WAS进程是否正在运行以及所有打开的端口是否正在监听。Readiness探针通过HTTP GET调用WAS的健康检查API。启动探针使用与Liveness探针相同的逻辑，但还有一个额外的任务来初始化健康检查API。这意味着如果启动探针没有执行，健康检查API将不会启用，而就绪探针将始终失败。

我猜想

如果启动探针反复失败超过阈值，容器将被重新启动。
如果启动探针正常运行，那么就绪探针不应该失败。（请注意，如果启动探针也失败，那么健康检查API的启用也会失败，因此就绪探针不会失败，即使启动探针成功）。

总之，就绪探针不应该失败，因为它失败的唯一情况是容器重新启动或就绪探针成功。此外，启动探针的阈值为36次，间隔为5秒，因此活动性/就绪探针不应该运行180秒。然而，有情况下就绪探针在3分钟之前失败。

这让我相信启动探针的行为被覆盖，活动性/就绪探针被执行。

根据Kubernetes文档，启动探针是确保活动性/就绪探针在正确时间运行的探针。问题是，如果忽略了这个探针，通过initialDelaySeconds的绝对时间来定时不如绝对时间准确。

首先，我想知道我猜测的问题是否真的存在。我也不知道如何验证这一点。即使在K8s事件中，我只能看到就绪探针失败的事件，而看不到启动探针的成功/失败。也许我误解了启动探针的工作原理。我希望有人能提供正确的解决方案。

以下是我编写的探针配置。

livenessProbe:
  exec:
    command:
    - liveness 
  initialDelaySeconds: 10
readinessProbe:
  exec:
    command:
    - readiness
  initialDelaySeconds: 10
startupProbe:
  exec:
    command:
    - liveness
    - -startup
  failureThreshold: 36
  periodSeconds: 5

使用类似kubectl describe和kubectl logs的命令来分析日志
检查K8s事件等。

英文:

I deployed WAS to Kubernetes(version 1.16). I used all three types of probes.

The Liveness probe is set to check if the WAS process is running and if all open ports are listening. The Readiness probe calls the healthcheck api of WAS via http get. The Startup probe uses the same logic as the Liveness probe, but has an additional task to init the healthcheck api. This means that if the Startup probe is not executed, the healthcheck api will not be enabled, and the readiness probe will always fail.

My guess is that

if the Startup probe fails repeatedly beyond a threshold, the container will be restarted.
if the startup probe ran normally, the readiness probe shouldn't fail. (Note that the startup probe also fails if the h.c. api enabling fails, so there is no case where the readiness probe fails even though the startup probe succeeds).

In conclusion, there should not be a situation where the readiness probe fails, because the only cases where it does are when the container is restarted or the readiness probe succeeds. In addition, the startup probe has a threshold of 36 times and a period of 5 seconds, so the liveness/readiness probe should not run for 180 seconds. However, there are cases where the readiness probe fails before 3 minutes.

This leads me to believe that the behavior of the startup probe is overridden and the liveness/readiness probe is executed.

According to the kubernetes docs, the startup probe is a probe to ensure that the liveness/readiness probe runs at the right time. The problem is that if this probe is ignored, timing in absolute time via initialDelaySeconds is not as good as timing in absolute time.

First of all, I'm wondering if the problem I'm guessing actually happens. I also don't know how to verify this. Even in k8s events, I could only see the readiness probe failed event, not the success/failure of the startup probe. Maybe I misunderstood how the startup probe works. I hope someone can provide a proper solution.

Below is the configuration of the probe I wrote.

livenessProbe:
  exec:
    command:
    - liveness 
  initialDelaySeconds: 10
readinessProbe:
  exec:
    command:
    - readiness
  initialDelaySeconds: 10
startupProbe:
  exec:
    command:
    - liveness
    - -startup
  failureThreshold: 36
  periodSeconds: 5

Analyze logs with commands like kubectl describe and kubectl logs
Checking k8s events, etc.

答案1

得分: 2

启动探针在 1.16 版本中默认未启用。要使用此探针，您必须启用功能门。

英文:

The startup probe is not enabled by default for 1.16 version https://v1-22.docs.kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/. To use this probe, you have to enable feature gate.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么启动探测被忽略？

问题

答案1

Add new node pool to GKE cluster 向 GKE 集群添加新的节点池

PKIX path validation failed: java.security.cert.CertPathValidatorException: signature check failed – jenkins

在Kubernetes集群中随机终止Pods。

K8S集群级拓扑分布

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。