GKE 在部署/容器中无法挂载卷:等待条件超时

huangapple go评论83阅读模式
英文:

GKE fails to mount volumes to deployment/pods: timeout waiting for the condition

问题

几乎两年后,我们遇到了与此Stack Overflow帖子中描述的相同问题。

自2018年以来,我们的工作负载一直正常工作,但突然因为需要更新证书而停止。然后,我们就再也无法启动工作负载了...失败的原因是Pod尝试通过NFS挂载持久性磁盘,而nfs-server Pod(基于gcr.io/google_containers/volume-nfs:0.8)无法挂载持久性磁盘。

我们已经从1.23升级到1.25.5-gke.2000(尝试了一些中间版本),因此还切换到了containerd

我们已经多次以轻微变化重新创建了一切,但没有运气。Pod明显无法访问任何持久性磁盘。

我们已经检查了一些基本事项,如:持久性磁盘和集群与GKE集群位于同一区域,Pod使用的服务帐户具有访问磁盘的必要权限等。

每个Pod上都没有可见的日志,这也很奇怪,因为日志似乎已经正确配置。

以下是nfs-server.yaml的内容:

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. creationTimestamp: null
  5. labels:
  6. role: nfs-server
  7. name: nfs-server
  8. spec:
  9. replicas: 1
  10. selector:
  11. matchLabels:
  12. role: nfs-server
  13. strategy:
  14. rollingUpdate:
  15. maxSurge: 1
  16. maxUnavailable: 1
  17. type: RollingUpdate
  18. template:
  19. metadata:
  20. labels:
  21. role: nfs-server
  22. spec:
  23. containers:
  24. - image: gcr.io/google_containers/volume-nfs:0.8
  25. imagePullPolicy: IfNotPresent
  26. name: nfs-server
  27. ports:
  28. - containerPort: 2049
  29. name: nfs
  30. protocol: TCP
  31. - containerPort: 20048
  32. name: mountd
  33. protocol: TCP
  34. - containerPort: 111
  35. name: rpcbind
  36. protocol: TCP
  37. resources: {}
  38. securityContext:
  39. privileged: true
  40. terminationMessagePath: /dev/termination-log
  41. terminationMessagePolicy: File
  42. volumeMounts:
  43. - mountPath: /exports
  44. name: webapp-disk
  45. dnsPolicy: ClusterFirst
  46. restartPolicy: Always
  47. schedulerName: default-scheduler
  48. securityContext: {}
  49. terminationGracePeriodSeconds: 30
  50. volumes:
  51. - gcePersistentDisk:
  52. fsType: ext4
  53. pdName: webapp-data-disk
  54. name: webapp-disk
  55. status: {}

请注意,这只是所提供内容的翻译部分,不包括代码。

英文:

Almost two years later, we are experiencing the same issue as described in this SO post.

Our workloads had been working without any disruption since 2018, and they suddenly stopped because we had to renew certificates. Then we've not been able to start the workloads again... The failure is caused by the fact that pods try to mount a persistence disk via NFS, and the
nfs-server pod (based on gcr.io/google_containers/volume-nfs:0.8) can't mount the persistent disk.

We have upgraded from 1.23 to 1.25.5-gke.2000 (experimenting with a few intermediary previous) and hence have also switched to containerd.

We have recreated everything multiple times with slight varioations, but no luck. Pods definitely cannot access any persistent disk.

We've checked basic things such as: the persistent disks and cluster are in the same zone as the GKE cluster, the service account used by the pods has the necessary permissions to access the disk, etc.

No logs are visible on, each pod, which is also strange since logging seems to be correctly configured.

Here is the nfs-server.yaml:

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. creationTimestamp: null
  5. labels:
  6. role: nfs-server
  7. name: nfs-server
  8. spec:
  9. replicas: 1
  10. selector:
  11. matchLabels:
  12. role: nfs-server
  13. strategy:
  14. rollingUpdate:
  15. maxSurge: 1
  16. maxUnavailable: 1
  17. type: RollingUpdate
  18. template:
  19. metadata:
  20. labels:
  21. role: nfs-server
  22. spec:
  23. containers:
  24. - image: gcr.io/google_containers/volume-nfs:0.8
  25. imagePullPolicy: IfNotPresent
  26. name: nfs-server
  27. ports:
  28. - containerPort: 2049
  29. name: nfs
  30. protocol: TCP
  31. - containerPort: 20048
  32. name: mountd
  33. protocol: TCP
  34. - containerPort: 111
  35. name: rpcbind
  36. protocol: TCP
  37. resources: {}
  38. securityContext:
  39. privileged: true
  40. terminationMessagePath: /dev/termination-log
  41. terminationMessagePolicy: File
  42. volumeMounts:
  43. - mountPath: /exports
  44. name: webapp-disk
  45. dnsPolicy: ClusterFirst
  46. restartPolicy: Always
  47. schedulerName: default-scheduler
  48. securityContext: {}
  49. terminationGracePeriodSeconds: 30
  50. volumes:
  51. - gcePersistentDisk:
  52. fsType: ext4
  53. pdName: webapp-data-disk
  54. name: webapp-disk
  55. status: {}

答案1

得分: 3

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

英文:

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

huangapple
  • 本文由 发表于 2023年2月8日 12:47:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381474.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定