in a google cloud Kubernetes cluster my pods sometimes all restart, how do I find the reason for the restart?

huangapple go评论69阅读模式
英文:

in a google cloud Kubernetes cluster my pods sometimes all restart, how do I find the reason for the restart?

问题

不确定为什么我的所有Pod偶尔会重新启动,不知道如何找出原因。Google Cloud中是否有地方可以获取这些信息?或者是否有要运行的kubectl命令?这种情况大约每隔几个月发生一次,也许不太频繁。

英文:

From time to time all my pods restart and I'm not sure how to figure out why it's happening. Is there someplace in google cloud where I can get that information? or a kubectl command to run? It happens every couple of months or so. maybe less frequently than that.

答案1

得分: 3

这也是检查您的集群和节点池操作的好方法。

  1. 在Cloud Shell中检查集群操作并运行以下命令:
gcloud container operations list
  1. 使用以下命令检查节点的年龄:
kubectl get nodes
  1. 检查并分析您的部署如何对集群升级、节点池升级和节点池自动修复等操作做出响应。您可以使用以下查询检查云日志,以了解您的集群升级或节点池升级的情况。请注意,在查询中添加您的集群和节点池名称。

控制平面(主节点)升级:

resource.type="gke_cluster"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
  OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name=""

节点池升级:

resource.type="gke_nodepool"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name=""
resource.labels.nodepool_name=""
英文:

It's also a good thing to check your cluster and node-pool operations.

  1. Check the cluster operation in cloud shell and run the command:
gcloud container operations list
  1. Check the age of the nodes with the command
kubectl get nodes
  1. Check and analyze your deployment on how it reacts to operations such as cluster upgrade, node-pool upgrade & node-pool auto-repair. You can check the cloud logging if your cluster upgrade or node-pool upgrades using queries below:

Please note you have to add your cluster and node-pool name in the queries.

Control plane (master) upgraded:

resource.type="gke_cluster"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
  OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name=""

Node-pool upgraded

resource.type="gke_nodepool"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name=""
resource.labels.nodepool_name=""

答案2

得分: 2

使用以下方法来检查 Pod 重启的原因:

使用 kubectl describe deployment <deployment_name>kubectl describe pod <pod_name>,其中包含有关信息。

# 事件:
#   类型     原因     年龄                 来自               消息
#   ----     ------   ----                ----               -------
#   警告    BackOff  40 分钟                 kubelet, gke-xx    退避重启失败的容器
# ..

您可以看到 Pod 由于镜像拉取退避而重新启动。我们需要对该问题进行故障排除。

使用以下命令来检查日志:kubectl logs <pod_name>

要获取容器(重新启动的容器)的先前日志,可以在 Pod 上使用 --previous 键,如下所示:

kubectl logs your_pod_name --previous

您还可以将最终消息写入 /dev/termination-log,这将显示在文档中描述的方式中。

附上故障排除文档供参考。

英文:

Using below methods for checking the reason for pod restart:

Use kubectl describe deployment &lt;deployment_name&gt; and kubectl describe pod &lt;pod_name&gt; which contains the information.

# Events:
#   Type     Reason   Age                 From               Message
#   ----     ------   ----                ----               -------
#   Warning  BackOff  40m                 kubelet, gke-xx    Back-off restarting failed container
# ..

You can see that the pod is restarted due to image pull backoff. We need to troubleshoot on that particular issue.

Check for logs using : kubectl logs &lt;pod_name&gt;

To get previous logs of your container (the restarted one), you may use --previous key on pod, like this:

kubectl logs your_pod_name --previous

You can also write a final message to /dev/termination-log, and this will show up as described in docs.

Attaching a troubleshooting doc for reference.

huangapple
  • 本文由 发表于 2023年1月9日 01:31:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75049951.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定