英文:
in a google cloud Kubernetes cluster my pods sometimes all restart, how do I find the reason for the restart?
问题
不确定为什么我的所有Pod偶尔会重新启动,不知道如何找出原因。Google Cloud中是否有地方可以获取这些信息?或者是否有要运行的kubectl命令?这种情况大约每隔几个月发生一次,也许不太频繁。
英文:
From time to time all my pods restart and I'm not sure how to figure out why it's happening. Is there someplace in google cloud where I can get that information? or a kubectl command to run? It happens every couple of months or so. maybe less frequently than that.
答案1
得分: 3
这也是检查您的集群和节点池操作的好方法。
- 在Cloud Shell中检查集群操作并运行以下命令:
gcloud container operations list
- 使用以下命令检查节点的年龄:
kubectl get nodes
- 检查并分析您的部署如何对集群升级、节点池升级和节点池自动修复等操作做出响应。您可以使用以下查询检查云日志,以了解您的集群升级或节点池升级的情况。请注意,在查询中添加您的集群和节点池名称。
控制平面(主节点)升级:
resource.type="gke_cluster"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name=""
节点池升级:
resource.type="gke_nodepool"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name=""
resource.labels.nodepool_name=""
英文:
It's also a good thing to check your cluster and node-pool operations.
- Check the cluster operation in cloud shell and run the command:
gcloud container operations list
- Check the age of the nodes with the command
kubectl get nodes
- Check and analyze your deployment on how it reacts to operations such as cluster upgrade, node-pool upgrade & node-pool auto-repair. You can check the cloud logging if your cluster upgrade or node-pool upgrades using queries below:
Please note you have to add your cluster and node-pool name in the queries.
Control plane (master) upgraded:
resource.type="gke_cluster"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name=""
Node-pool upgraded
resource.type="gke_nodepool"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name=""
resource.labels.nodepool_name=""
答案2
得分: 2
使用以下方法来检查 Pod 重启的原因:
使用 kubectl describe deployment <deployment_name>
和 kubectl describe pod <pod_name>
,其中包含有关信息。
# 事件:
# 类型 原因 年龄 来自 消息
# ---- ------ ---- ---- -------
# 警告 BackOff 40 分钟 kubelet, gke-xx 退避重启失败的容器
# ..
您可以看到 Pod 由于镜像拉取退避而重新启动。我们需要对该问题进行故障排除。
使用以下命令来检查日志:kubectl logs <pod_name>
要获取容器(重新启动的容器)的先前日志,可以在 Pod 上使用 --previous 键,如下所示:
kubectl logs your_pod_name --previous
您还可以将最终消息写入 /dev/termination-log,这将显示在文档中描述的方式中。
附上故障排除文档供参考。
英文:
Using below methods for checking the reason for pod restart:
Use kubectl describe deployment <deployment_name>
and kubectl describe pod <pod_name>
which contains the information.
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning BackOff 40m kubelet, gke-xx Back-off restarting failed container
# ..
You can see that the pod is restarted due to image pull backoff. We need to troubleshoot on that particular issue.
Check for logs using : kubectl logs <pod_name>
To get previous logs of your container (the restarted one), you may use --previous key on pod, like this:
kubectl logs your_pod_name --previous
You can also write a final message to /dev/termination-log, and this will show up as described in docs.
Attaching a troubleshooting doc for reference.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论