Kubernetes KEDA scaledjob is not responding.

huangapple go评论53阅读模式
英文:

Kubernetes KEDA scaledjob is not responding

问题

我们正在使用配置在AKS集群中的Azure DevOps代理,使用Keda scaledjobs。AKS节点池SKU为Standard_E8ds_v5(1个实例),我们使用Azure磁盘上的持久卷挂载。

scaledJob属性如下。

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
annotations:
name: azdevops-scaledjob
namespace: ado
spec:
failedJobsHistoryLimit: 5
jobTargetRef:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- mypool
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westeurope-1
weight: 2
containers:
- env:
- name: AZP_URL
value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
- name: AZP_TOKEN
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- name: AZP_POOL
value: az-pool
image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
imagePullPolicy: Always
name: azdevops-agent-job
resources:
limits:
cpu: 1500m
memory: 6Gi
requests:
cpu: 500m
memory: 3Gi
securityContext:
allowPrivilegeEscalation: true
privileged: true
volumeMounts:
- mountPath: /mnt
name: ado-cache-storage
volumes:
- name: ado-cache-storage
persistentVolumeClaim:
claimName: azure-disk-pvc
maxReplicaCount: 8
minReplicaCount: 1
pollingInterval: 30
successfulJobsHistoryLimit: 5
triggers:

  • metadata:
    organizationURLFromEnv: AZP_URL
    personalAccessTokenFromEnv: AZP_TOKEN
    poolID: "xxxx"
    type: azure-pipelines

但我们注意到一个奇怪的行为,尝试触发构建时,在管道中出现错误消息:

"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error."

管道将处于挂起状态,并且将继续运行,但在后台,Pod已经处于错误状态。因此,每次出现此问题时,我们必须取消管道并启动新的构建,以便将管道调度到可用的Pod。

在描述处于错误状态的Pod时,我们可以识别到以下信息。

azdevops-scaledjob-6xxxxxxxx-b 0/1 Error 0 27h

Pod的错误如下。

Annotations:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.

英文:

We are using azuredevops agent configured in AKS cluster with the Keda scaledjobs. The AKS node pool sku is Standard_E8ds_v5 (1 instance) and we are using persistent volume mounted on azure disk .

the scaledJob property is as below.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  annotations:
  name: azdevops-scaledjob
  namespace: ado
spec:
  failedJobsHistoryLimit: 5
  jobTargetRef:
    template:
      spec:
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - preference:
                matchExpressions:
                - key: kubernetes.azure.com/mode
                  operator: In
                  values:
                  - mypool
                - key: topology.disk.csi.azure.com/zone
                  operator: In
                  values:
                  - westeurope-1
              weight: 2
        containers:
        - env:
          - name: AZP_URL
            value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
          - name: AZP_TOKEN
            value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
          - name: AZP_POOL
            value: az-pool
          image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
          imagePullPolicy: Always
          name: azdevops-agent-job
          resources:
            limits:
              cpu: 1500m
              memory: 6Gi
            requests:
              cpu: 500m
              memory: 3Gi
          securityContext:
            allowPrivilegeEscalation: true
            privileged: true
          volumeMounts:
          - mountPath: /mnt
            name: ado-cache-storage
        volumes:
        - name: ado-cache-storage
          persistentVolumeClaim:
            claimName: azure-disk-pvc
  maxReplicaCount: 8
  minReplicaCount: 1
  pollingInterval: 30
  successfulJobsHistoryLimit: 5
  triggers:
  - metadata:
      organizationURLFromEnv: AZP_URL
      personalAccessTokenFromEnv: AZP_TOKEN
      poolID: "xxxx"
    type: azure-pipelines

But we noticed a strange behavior as when trying to trigger a build, Error message in the pipeline:

"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error".

The pipeline will be in hang state and will be continuing without error, but in backend the pod is already in state error. So we have to cancel the pipeline each time when it occures and need to iniatiate a new build, so that the pipeline will be scheduled to a available pod.

On describing the pod which is in error state, we could identify this.

azdevops-scaledjob-6xxxxxxxx-b   0/1     Error     0          27h

Pod has error as below.

Annotations:  <none>
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.

答案1

得分: 1

我已将safe-to-evict设置为false,因此AKS不会因为节点downscale而重新定位Pod/Job。

不过,这样做的缺点是,AKS可能会保留比所需更多的节点。因此,您必须确保Pod/Job不会永远存在。

规格:

spec:
  jobTargetRef:
    template:
      metadata:
        annotations:
          "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

另一种可能性是更改节点下降超时。

Terraform 代码:

auto_scaler_profile {
  scale_down_unneeded = "90m"
}
英文:

I have set the safe-to-evict to false, so the AKS won't relocate the pod/job because node downscale.

The drawback here is that AKS can stay with more nodes than needed. So you must ensure the pod/job won't be there forever.

spec:
  jobTargetRef:
    template:
      metadata:
        annotations:
          "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

Another possibility is to change the node downscale timeout

Terraform code

  auto_scaler_profile {
    scale_down_unneeded = "90m"
  }

huangapple
  • 本文由 发表于 2023年5月11日 02:37:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76221642.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定