英文:
GKE fails to mount volumes to deployment/pods: timeout waiting for the condition
问题
几乎两年后,我们遇到了与此Stack Overflow帖子中描述的相同问题。
自2018年以来,我们的工作负载一直正常工作,但突然因为需要更新证书而停止。然后,我们就再也无法启动工作负载了...失败的原因是Pod尝试通过NFS挂载持久性磁盘,而nfs-server
Pod(基于gcr.io/google_containers/volume-nfs:0.8
)无法挂载持久性磁盘。
我们已经从1.23升级到1.25.5-gke.2000(尝试了一些中间版本),因此还切换到了containerd
。
我们已经多次以轻微变化重新创建了一切,但没有运气。Pod明显无法访问任何持久性磁盘。
我们已经检查了一些基本事项,如:持久性磁盘和集群与GKE集群位于同一区域,Pod使用的服务帐户具有访问磁盘的必要权限等。
每个Pod上都没有可见的日志,这也很奇怪,因为日志似乎已经正确配置。
以下是nfs-server.yaml
的内容:
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
role: nfs-server
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
role: nfs-server
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
role: nfs-server
spec:
containers:
- image: gcr.io/google_containers/volume-nfs:0.8
imagePullPolicy: IfNotPresent
name: nfs-server
ports:
- containerPort: 2049
name: nfs
protocol: TCP
- containerPort: 20048
name: mountd
protocol: TCP
- containerPort: 111
name: rpcbind
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /exports
name: webapp-disk
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- gcePersistentDisk:
fsType: ext4
pdName: webapp-data-disk
name: webapp-disk
status: {}
请注意,这只是所提供内容的翻译部分,不包括代码。
英文:
Almost two years later, we are experiencing the same issue as described in this SO post.
Our workloads had been working without any disruption since 2018, and they suddenly stopped because we had to renew certificates. Then we've not been able to start the workloads again... The failure is caused by the fact that pods try to mount a persistence disk via NFS, and the
nfs-server
pod (based on gcr.io/google_containers/volume-nfs:0.8
) can't mount the persistent disk.
We have upgraded from 1.23 to 1.25.5-gke.2000 (experimenting with a few intermediary previous) and hence have also switched to containerd
.
We have recreated everything multiple times with slight varioations, but no luck. Pods definitely cannot access any persistent disk.
We've checked basic things such as: the persistent disks and cluster are in the same zone as the GKE cluster, the service account used by the pods has the necessary permissions to access the disk, etc.
No logs are visible on, each pod, which is also strange since logging seems to be correctly configured.
Here is the nfs-server.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
role: nfs-server
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
role: nfs-server
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
role: nfs-server
spec:
containers:
- image: gcr.io/google_containers/volume-nfs:0.8
imagePullPolicy: IfNotPresent
name: nfs-server
ports:
- containerPort: 2049
name: nfs
protocol: TCP
- containerPort: 20048
name: mountd
protocol: TCP
- containerPort: 111
name: rpcbind
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /exports
name: webapp-disk
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- gcePersistentDisk:
fsType: ext4
pdName: webapp-data-disk
name: webapp-disk
status: {}
答案1
得分: 3
OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...
英文:
OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论