英文:
Kubernetes KEDA scaledjob is not responding
问题
我们正在使用配置在AKS集群中的Azure DevOps代理,使用Keda scaledjobs。AKS节点池SKU为Standard_E8ds_v5(1个实例),我们使用Azure磁盘上的持久卷挂载。
scaledJob属性如下。
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
annotations:
name: azdevops-scaledjob
namespace: ado
spec:
failedJobsHistoryLimit: 5
jobTargetRef:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- mypool
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westeurope-1
weight: 2
containers:
- env:
- name: AZP_URL
value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
- name: AZP_TOKEN
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- name: AZP_POOL
value: az-pool
image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
imagePullPolicy: Always
name: azdevops-agent-job
resources:
limits:
cpu: 1500m
memory: 6Gi
requests:
cpu: 500m
memory: 3Gi
securityContext:
allowPrivilegeEscalation: true
privileged: true
volumeMounts:
- mountPath: /mnt
name: ado-cache-storage
volumes:
- name: ado-cache-storage
persistentVolumeClaim:
claimName: azure-disk-pvc
maxReplicaCount: 8
minReplicaCount: 1
pollingInterval: 30
successfulJobsHistoryLimit: 5
triggers:
- metadata:
organizationURLFromEnv: AZP_URL
personalAccessTokenFromEnv: AZP_TOKEN
poolID: "xxxx"
type: azure-pipelines
但我们注意到一个奇怪的行为,尝试触发构建时,在管道中出现错误消息:
"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error."
管道将处于挂起状态,并且将继续运行,但在后台,Pod已经处于错误状态。因此,每次出现此问题时,我们必须取消管道并启动新的构建,以便将管道调度到可用的Pod。
在描述处于错误状态的Pod时,我们可以识别到以下信息。
azdevops-scaledjob-6xxxxxxxx-b 0/1 Error 0 27h
Pod的错误如下。
Annotations:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.
英文:
We are using azuredevops agent configured in AKS cluster with the Keda scaledjobs. The AKS node pool sku is Standard_E8ds_v5 (1 instance) and we are using persistent volume mounted on azure disk .
the scaledJob property is as below.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
annotations:
name: azdevops-scaledjob
namespace: ado
spec:
failedJobsHistoryLimit: 5
jobTargetRef:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- mypool
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westeurope-1
weight: 2
containers:
- env:
- name: AZP_URL
value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
- name: AZP_TOKEN
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- name: AZP_POOL
value: az-pool
image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
imagePullPolicy: Always
name: azdevops-agent-job
resources:
limits:
cpu: 1500m
memory: 6Gi
requests:
cpu: 500m
memory: 3Gi
securityContext:
allowPrivilegeEscalation: true
privileged: true
volumeMounts:
- mountPath: /mnt
name: ado-cache-storage
volumes:
- name: ado-cache-storage
persistentVolumeClaim:
claimName: azure-disk-pvc
maxReplicaCount: 8
minReplicaCount: 1
pollingInterval: 30
successfulJobsHistoryLimit: 5
triggers:
- metadata:
organizationURLFromEnv: AZP_URL
personalAccessTokenFromEnv: AZP_TOKEN
poolID: "xxxx"
type: azure-pipelines
But we noticed a strange behavior as when trying to trigger a build, Error message in the pipeline:
"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error".
The pipeline will be in hang state and will be continuing without error, but in backend the pod is already in state error. So we have to cancel the pipeline each time when it occures and need to iniatiate a new build, so that the pipeline will be scheduled to a available pod.
On describing the pod which is in error state, we could identify this.
azdevops-scaledjob-6xxxxxxxx-b 0/1 Error 0 27h
Pod has error as below.
Annotations: <none>
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.
答案1
得分: 1
我已将safe-to-evict
设置为false,因此AKS不会因为节点downscale
而重新定位Pod/Job。
不过,这样做的缺点是,AKS可能会保留比所需更多的节点。因此,您必须确保Pod/Job不会永远存在。
规格:
spec:
jobTargetRef:
template:
metadata:
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
另一种可能性是更改节点下降超时。
Terraform 代码:
auto_scaler_profile {
scale_down_unneeded = "90m"
}
英文:
I have set the safe-to-evict
to false, so the AKS won't relocate the pod/job because node downscale
.
The drawback here is that AKS can stay with more nodes than needed. So you must ensure the pod/job won't be there forever.
spec:
jobTargetRef:
template:
metadata:
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
Another possibility is to change the node downscale timeout
Terraform code
auto_scaler_profile {
scale_down_unneeded = "90m"
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论