GKE 在部署/容器中无法挂载卷:等待条件超时

huangapple go评论48阅读模式
英文:

GKE fails to mount volumes to deployment/pods: timeout waiting for the condition

问题

几乎两年后,我们遇到了与此Stack Overflow帖子中描述的相同问题。

自2018年以来,我们的工作负载一直正常工作,但突然因为需要更新证书而停止。然后,我们就再也无法启动工作负载了...失败的原因是Pod尝试通过NFS挂载持久性磁盘,而nfs-server Pod(基于gcr.io/google_containers/volume-nfs:0.8)无法挂载持久性磁盘。

我们已经从1.23升级到1.25.5-gke.2000(尝试了一些中间版本),因此还切换到了containerd

我们已经多次以轻微变化重新创建了一切,但没有运气。Pod明显无法访问任何持久性磁盘。

我们已经检查了一些基本事项,如:持久性磁盘和集群与GKE集群位于同一区域,Pod使用的服务帐户具有访问磁盘的必要权限等。

每个Pod上都没有可见的日志,这也很奇怪,因为日志似乎已经正确配置。

以下是nfs-server.yaml的内容:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    role: nfs-server
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - image: gcr.io/google_containers/volume-nfs:0.8
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          name: nfs
          protocol: TCP
        - containerPort: 20048
          name: mountd
          protocol: TCP
        - containerPort: 111
          name: rpcbind
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /exports
          name: webapp-disk
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - gcePersistentDisk:
          fsType: ext4
          pdName: webapp-data-disk
        name: webapp-disk
status: {}

请注意,这只是所提供内容的翻译部分,不包括代码。

英文:

Almost two years later, we are experiencing the same issue as described in this SO post.

Our workloads had been working without any disruption since 2018, and they suddenly stopped because we had to renew certificates. Then we've not been able to start the workloads again... The failure is caused by the fact that pods try to mount a persistence disk via NFS, and the
nfs-server pod (based on gcr.io/google_containers/volume-nfs:0.8) can't mount the persistent disk.

We have upgraded from 1.23 to 1.25.5-gke.2000 (experimenting with a few intermediary previous) and hence have also switched to containerd.

We have recreated everything multiple times with slight varioations, but no luck. Pods definitely cannot access any persistent disk.

We've checked basic things such as: the persistent disks and cluster are in the same zone as the GKE cluster, the service account used by the pods has the necessary permissions to access the disk, etc.

No logs are visible on, each pod, which is also strange since logging seems to be correctly configured.

Here is the nfs-server.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    role: nfs-server
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - image: gcr.io/google_containers/volume-nfs:0.8
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          name: nfs
          protocol: TCP
        - containerPort: 20048
          name: mountd
          protocol: TCP
        - containerPort: 111
          name: rpcbind
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /exports
          name: webapp-disk
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - gcePersistentDisk:
          fsType: ext4
          pdName: webapp-data-disk
        name: webapp-disk
status: {}

答案1

得分: 3

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

英文:

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

huangapple
  • 本文由 发表于 2023年2月8日 12:47:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381474.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定