2023年2月8日 12:47:27go评论83阅读模式

英文:

GKE fails to mount volumes to deployment/pods: timeout waiting for the condition

问题

几乎两年后，我们遇到了与此Stack Overflow帖子中描述的相同问题。

自2018年以来，我们的工作负载一直正常工作，但突然因为需要更新证书而停止。然后，我们就再也无法启动工作负载了...失败的原因是Pod尝试通过NFS挂载持久性磁盘，而nfs-server Pod（基于gcr.io/google_containers/volume-nfs:0.8）无法挂载持久性磁盘。

我们已经从1.23升级到1.25.5-gke.2000（尝试了一些中间版本），因此还切换到了containerd。

我们已经多次以轻微变化重新创建了一切，但没有运气。Pod明显无法访问任何持久性磁盘。

我们已经检查了一些基本事项，如：持久性磁盘和集群与GKE集群位于同一区域，Pod使用的服务帐户具有访问磁盘的必要权限等。

每个Pod上都没有可见的日志，这也很奇怪，因为日志似乎已经正确配置。

以下是nfs-server.yaml的内容：

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    role: nfs-server
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - image: gcr.io/google_containers/volume-nfs:0.8
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          name: nfs
          protocol: TCP
        - containerPort: 20048
          name: mountd
          protocol: TCP
        - containerPort: 111
          name: rpcbind
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /exports
          name: webapp-disk
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - gcePersistentDisk:
          fsType: ext4
          pdName: webapp-data-disk
        name: webapp-disk
status: {}

请注意，这只是所提供内容的翻译部分，不包括代码。

英文:

Almost two years later, we are experiencing the same issue as described in this SO post.

Our workloads had been working without any disruption since 2018, and they suddenly stopped because we had to renew certificates. Then we've not been able to start the workloads again... The failure is caused by the fact that pods try to mount a persistence disk via NFS, and the
nfs-server pod (based on gcr.io/google_containers/volume-nfs:0.8) can't mount the persistent disk.

We have upgraded from 1.23 to 1.25.5-gke.2000 (experimenting with a few intermediary previous) and hence have also switched to containerd.

We have recreated everything multiple times with slight varioations, but no luck. Pods definitely cannot access any persistent disk.

We've checked basic things such as: the persistent disks and cluster are in the same zone as the GKE cluster, the service account used by the pods has the necessary permissions to access the disk, etc.

No logs are visible on, each pod, which is also strange since logging seems to be correctly configured.

Here is the nfs-server.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    role: nfs-server
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - image: gcr.io/google_containers/volume-nfs:0.8
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          name: nfs
          protocol: TCP
        - containerPort: 20048
          name: mountd
          protocol: TCP
        - containerPort: 111
          name: rpcbind
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /exports
          name: webapp-disk
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - gcePersistentDisk:
          fsType: ext4
          pdName: webapp-data-disk
        name: webapp-disk
status: {}

答案1

得分: 3

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

英文:

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

GKE 在部署/容器中无法挂载卷：等待条件超时

问题

答案1

Kubernetes内部主机名解析为localhost。

如何使 kube-proxy 均匀分发负载？

最佳实践是均匀地将 Pod 分散在节点之间。

使用Kubernetes API并增加速率限制值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。