2023年1月9日 01:31:50go评论98阅读模式

英文:

in a google cloud Kubernetes cluster my pods sometimes all restart, how do I find the reason for the restart?

问题

不确定为什么我的所有Pod偶尔会重新启动，不知道如何找出原因。Google Cloud中是否有地方可以获取这些信息？或者是否有要运行的kubectl命令？这种情况大约每隔几个月发生一次，也许不太频繁。

英文:

From time to time all my pods restart and I'm not sure how to figure out why it's happening. Is there someplace in google cloud where I can get that information? or a kubectl command to run? It happens every couple of months or so. maybe less frequently than that.

答案1

得分: 3

这也是检查您的集群和节点池操作的好方法。

在Cloud Shell中检查集群操作并运行以下命令：

gcloud container operations list

使用以下命令检查节点的年龄：

kubectl get nodes

检查并分析您的部署如何对集群升级、节点池升级和节点池自动修复等操作做出响应。您可以使用以下查询检查云日志，以了解您的集群升级或节点池升级的情况。请注意，在查询中添加您的集群和节点池名称。

控制平面（主节点）升级：

resource.type="gke_cluster"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
  OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name=""

节点池升级：

resource.type="gke_nodepool"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name=""
resource.labels.nodepool_name=""

英文:

It's also a good thing to check your cluster and node-pool operations.

Check the cluster operation in cloud shell and run the command:

gcloud container operations list

Check the age of the nodes with the command

kubectl get nodes

Check and analyze your deployment on how it reacts to operations such as cluster upgrade, node-pool upgrade & node-pool auto-repair. You can check the cloud logging if your cluster upgrade or node-pool upgrades using queries below:

Please note you have to add your cluster and node-pool name in the queries.

Control plane (master) upgraded:

resource.type=&quot;gke_cluster&quot;
log_id(&quot;cloudaudit.googleapis.com/activity&quot;)
protoPayload.methodName:(&quot;UpdateCluster&quot; OR &quot;UpdateClusterInternal&quot;)
(protoPayload.metadata.operationType=&quot;UPGRADE_MASTER&quot;
  OR protoPayload.response.operationType=&quot;UPGRADE_MASTER&quot;)
resource.labels.cluster_name=&quot;&quot;

Node-pool upgraded

resource.type=&quot;gke_nodepool&quot;
log_id(&quot;cloudaudit.googleapis.com/activity&quot;)
protoPayload.methodName:(&quot;UpdateNodePool&quot; OR &quot;UpdateClusterInternal&quot;)
protoPayload.metadata.operationType=&quot;UPGRADE_NODES&quot;
resource.labels.cluster_name=&quot;&quot;
resource.labels.nodepool_name=&quot;&quot;

答案2

得分: 2

使用以下方法来检查 Pod 重启的原因：

使用 kubectl describe deployment <deployment_name> 和 kubectl describe pod <pod_name>，其中包含有关信息。

# 事件：
#   类型     原因     年龄                 来自               消息
#   ----     ------   ----                ----               -------
#   警告    BackOff  40 分钟                 kubelet, gke-xx    退避重启失败的容器
# ..

您可以看到 Pod 由于镜像拉取退避而重新启动。我们需要对该问题进行故障排除。

使用以下命令来检查日志：kubectl logs <pod_name>

要获取容器（重新启动的容器）的先前日志，可以在 Pod 上使用 --previous 键，如下所示：

kubectl logs your_pod_name --previous

您还可以将最终消息写入 /dev/termination-log，这将显示在文档中描述的方式中。

附上故障排除文档供参考。

英文:

Using below methods for checking the reason for pod restart:

Use kubectl describe deployment <deployment_name> and kubectl describe pod <pod_name> which contains the information.

# Events:
#   Type     Reason   Age                 From               Message
#   ----     ------   ----                ----               -------
#   Warning  BackOff  40m                 kubelet, gke-xx    Back-off restarting failed container
# ..

You can see that the pod is restarted due to image pull backoff. We need to troubleshoot on that particular issue.

Check for logs using : kubectl logs <pod_name>

To get previous logs of your container (the restarted one), you may use --previous key on pod, like this:

kubectl logs your_pod_name --previous

You can also write a final message to /dev/termination-log, and this will show up as described in docs.

Attaching a troubleshooting doc for reference.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

in a google cloud Kubernetes cluster my pods sometimes all restart, how do I find the reason for the restart?

问题

答案1

答案2

如何在Google Cloud Functions中实现“运行一次并重试”（带有Firestore）

你可以将持久卷索赔挂载到已挂载的卷上吗？

如何在Kubernetes中设置Pod的DNS

如何获取BigQuery中Parquet文件的“最后修改”信息

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。