2023年5月11日 02:37:04go评论130阅读模式

英文:

Kubernetes KEDA scaledjob is not responding

问题

我们正在使用配置在AKS集群中的Azure DevOps代理，使用Keda scaledjobs。AKS节点池SKU为Standard_E8ds_v5（1个实例），我们使用Azure磁盘上的持久卷挂载。

scaledJob属性如下。

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
annotations:
name: azdevops-scaledjob
namespace: ado
spec:
failedJobsHistoryLimit: 5
jobTargetRef:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- mypool
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- westeurope-1
weight: 2
containers:
- env:
- name: AZP_URL
value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
- name: AZP_TOKEN
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- name: AZP_POOL
value: az-pool
image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
imagePullPolicy: Always
name: azdevops-agent-job
resources:
limits:
cpu: 1500m
memory: 6Gi
requests:
cpu: 500m
memory: 3Gi
securityContext:
allowPrivilegeEscalation: true
privileged: true
volumeMounts:
- mountPath: /mnt
name: ado-cache-storage
volumes:
- name: ado-cache-storage
persistentVolumeClaim:
claimName: azure-disk-pvc
maxReplicaCount: 8
minReplicaCount: 1
pollingInterval: 30
successfulJobsHistoryLimit: 5
triggers:

metadata:
organizationURLFromEnv: AZP_URL
personalAccessTokenFromEnv: AZP_TOKEN
poolID: "xxxx"
type: azure-pipelines

但我们注意到一个奇怪的行为，尝试触发构建时，在管道中出现错误消息：

"We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error."

管道将处于挂起状态，并且将继续运行，但在后台，Pod已经处于错误状态。因此，每次出现此问题时，我们必须取消管道并启动新的构建，以便将管道调度到可用的Pod。

在描述处于错误状态的Pod时，我们可以识别到以下信息。

azdevops-scaledjob-6xxxxxxxx-b 0/1 Error 0 27h

Pod的错误如下。

Annotations:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.

英文:

We are using azuredevops agent configured in AKS cluster with the Keda scaledjobs. The AKS node pool sku is Standard_E8ds_v5 (1 instance) and we are using persistent volume mounted on azure disk .

the scaledJob property is as below.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  annotations:
  name: azdevops-scaledjob
  namespace: ado
spec:
  failedJobsHistoryLimit: 5
  jobTargetRef:
    template:
      spec:
        affinity:
          nodeAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - preference:
                matchExpressions:
                - key: kubernetes.azure.com/mode
                  operator: In
                  values:
                  - mypool
                - key: topology.disk.csi.azure.com/zone
                  operator: In
                  values:
                  - westeurope-1
              weight: 2
        containers:
        - env:
          - name: AZP_URL
            value: https://azuredevops.xxxxxxxx/xxxxxxx/organisation
          - name: AZP_TOKEN
            value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
          - name: AZP_POOL
            value: az-pool
          image: xxxxxxxxxxxxxx.azurecr.io/vsts/dockeragent:xxxxxxxxx
          imagePullPolicy: Always
          name: azdevops-agent-job
          resources:
            limits:
              cpu: 1500m
              memory: 6Gi
            requests:
              cpu: 500m
              memory: 3Gi
          securityContext:
            allowPrivilegeEscalation: true
            privileged: true
          volumeMounts:
          - mountPath: /mnt
            name: ado-cache-storage
        volumes:
        - name: ado-cache-storage
          persistentVolumeClaim:
            claimName: azure-disk-pvc
  maxReplicaCount: 8
  minReplicaCount: 1
  pollingInterval: 30
  successfulJobsHistoryLimit: 5
  triggers:
  - metadata:
      organizationURLFromEnv: AZP_URL
      personalAccessTokenFromEnv: AZP_TOKEN
      poolID: &quot;xxxx&quot;
    type: azure-pipelines

But we noticed a strange behavior as when trying to trigger a build, Error message in the pipeline:

&quot;We stopped hearing from agent azdevops-scaledjob-xxxxxxx. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error&quot;.

The pipeline will be in hang state and will be continuing without error, but in backend the pod is already in state error. So we have to cancel the pipeline each time when it occures and need to iniatiate a new build, so that the pipeline will be scheduled to a available pod.

On describing the pod which is in error state, we could identify this.

azdevops-scaledjob-6xxxxxxxx-b   0/1     Error     0          27h

Pod has error as below.

Annotations:  &lt;none&gt;
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: ephemeral-storage. Container azdevops-agent-job was using 23001896Ki, which exceeds its request of 0.

答案1

得分: 1

我已将safe-to-evict设置为false，因此AKS不会因为节点downscale而重新定位Pod/Job。

不过，这样做的缺点是，AKS可能会保留比所需更多的节点。因此，您必须确保Pod/Job不会永远存在。

规格：

spec:
  jobTargetRef:
    template:
      metadata:
        annotations:
          "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

另一种可能性是更改节点下降超时。

Terraform 代码：

auto_scaler_profile {
  scale_down_unneeded = "90m"
}

英文:

I have set the safe-to-evict to false, so the AKS won't relocate the pod/job because node downscale.

The drawback here is that AKS can stay with more nodes than needed. So you must ensure the pod/job won't be there forever.

spec:
  jobTargetRef:
    template:
      metadata:
        annotations:
          &quot;cluster-autoscaler.kubernetes.io/safe-to-evict&quot;: &quot;false&quot;

Another possibility is to change the node downscale timeout

Terraform code

  auto_scaler_profile {
    scale_down_unneeded = &quot;90m&quot;
  }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Kubernetes KEDA scaledjob is not responding.

问题

答案1

/proc/mounts没有像/proc/self/mountinfo那样的源信息。

最大化CustomResourceDefinition可以拥有的CustomResources数量 | kubebuilder和operator-sdk

如何将Cloud Composer DAG分配到特定的节点池运行？

在Kubernetes中有多个相同的有状态应用实例 – 不是数据库 – 如何进行管理？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论