英文:
Argocd notifications for long running sync operations
问题
我正在运行Argocd 2.7.4,并且希望在同步操作花费超过几分钟的情况下收到警报。我尝试使用文档中的示例(https://argo-cd.readthedocs.io/en/stable/operator-manual/notifications/triggers/#functions)
when: time.Now().Sub(time.Parse(app.status.operationState.startedAt)).Minutes() >= 5
然而,这只会在操作完成后触发,对我的用例来说有点无用。实际上是否可能直接使用Argocd创建这样的警报,还是我需要自己构建它?
英文:
I'm running Argocd 2.7.4 and I want to get an alert for sync operations that take longer than a few minutes. I tried using the example from the documentation (https://argo-cd.readthedocs.io/en/stable/operator-manual/notifications/triggers/#functions)
when: time.Now().Sub(time.Parse(app.status.operationState.startedAt)).Minutes() >= 5
However, that only triggers after the operation has finished which is kinda useless for my usecase. Is it actually possible to create an alert like this with argocd directly or do I need to build it myself?
答案1
得分: 1
我怀疑这不是原生支持的,这意味着您需要使用Argo CD的API和像Prometheus这样的监控工具的组合。想法是将Argo CD的指标暴露给Prometheus,然后基于同步操作的持续时间创建警报。
以下是您可以遵循的步骤的高级概述:
首先,通过配置argocd-metrics
服务,将Argo CD的Application Controller Metrics暴露给Prometheus。
然后,设置Prometheus以从Argo CD中获取指标,使用Prometheus配置。
例如,参考Ritesh Nanda(2022年10月)的“使用Prometheus运算符和Datadog监视Argo CD指标”示例。
接下来,在Prometheus中基于同步操作的持续时间创建自定义警报。
这将涉及查询应用程序的状态(ApplicationService):对/api/v1/applications/{applicationName}
端点的GET请求将包含有关应用程序同步状态的信息,位于status.sync.status
字段中。
要将同步操作持续时间暴露给Prometheus,您需要创建自定义导出器。该导出器将运行脚本以获取同步操作持续时间,并将其公开为Prometheus指标。
import requests
import time
from dateutil.parser import parse
from prometheus_client import start_http_server, Gauge
# Argo CD API服务器的URL
argo_cd_api_server = "http://<argocd-server>:<port>"
# 创建一个Gauge指标以跟踪同步操作的持续时间
SYNC_DURATION = Gauge('argocd_sync_duration_seconds', '同步操作的持续时间', ['application'])
def fetch_sync_durations():
# 获取所有应用程序的列表
response = requests.get(f"{argo_cd_api_server}/api/v1/applications")
applications = response.json()["items"]
for application in applications:
# 获取应用程序的状态
response = requests.get(f"{argo_cd_api_server}/api/v1/applications/{application['metadata']['name']}")
status = response.json()["status"]
# 如果应用程序当前正在同步,计算同步操作的持续时间
if status["sync"]["status"] == "Syncing":
started_at = parse(status["operationState"]["startedAt"])
current_duration = time.time() - started_at.timestamp()
SYNC_DURATION.labels(application=application['metadata']['name']).set(current_duration)
if __name__ == '__main__':
# 启动服务器以公开指标
start_http_server(8000)
# 每60秒获取一次同步操作的持续时间
while True:
fetch_sync_durations()
time.sleep(60)
这将定义一个名为argocd_sync_duration_seconds
的Gauge指标,用于跟踪同步操作的持续时间。fetch_sync_durations
函数获取同步操作的持续时间,并为每个应用程序设置Gauge指标的值。然后,脚本启动HTTP服务器以公开指标,并每60秒运行一次fetch_sync_durations
函数。
最后,配置一个警报管理器以基于警报发送通知。您可以按照Prometheus Alertmanager文档设置警报管理器并配置通知渠道(例如Slack、电子邮件等)。
有点复杂,但这将使Argo CD中的长时间同步操作能够触发自定义警报。
英文:
I suspect this is not natively supported, which means you would need to use a combination of Argo CD's API and a monitoring tool like Prometheus. The idea would be to expose Argo CD's metrics to Prometheus and then create an alert based on the duration of the sync operation.
Here is a high-level outline of the steps you can follow:
First, Expose Argo CD's Application Controller Metrics to Prometheus by configuring the argocd-metrics
service.
And set up Prometheus to scrape the metrics from Argo CD, using the Prometheus configuration.
See for example "Argo CD metrics with Prometheus operator and Datadog" by Ritesh Nanda (Oct. 2022)
Then create a custom alert in Prometheus based on the duration of the sync operation.
That would involve querying the applications' status (ApplicationService): a GET request to the /api/v1/applications/{applicationName}
endpoint will include information about the application's sync status in the status.sync.status
field.
To expose the sync operation durations to Prometheus, you would need to create a custom exporter. This exporter would run the script to fetch the sync operation durations and expose them as a Prometheus metric.
import requests
import time
from dateutil.parser import parse
from prometheus_client import start_http_server, Gauge
# URL of the Argo CD API server
argo_cd_api_server = "http://<argocd-server>:<port>"
# Create a Gauge metric to track sync operation durations
SYNC_DURATION = Gauge('argocd_sync_duration_seconds', 'Duration of sync operations', ['application'])
def fetch_sync_durations():
# Get a list of all applications
response = requests.get(f"{argo_cd_api_server}/api/v1/applications")
applications = response.json()["items"]
for application in applications:
# Get the status of the application
response = requests.get(f"{argo_cd_api_server}/api/v1/applications/{application['metadata']['name']}")
status = response.json()["status"]
# If the application is currently syncing, calculate the duration of the sync operation
if status["sync"]["status"] == "Syncing":
started_at = parse(status["operationState"]["startedAt"])
current_duration = time.time() - started_at.timestamp()
SYNC_DURATION.labels(application=application['metadata']['name']).set(current_duration)
if __name__ == '__main__':
# Start up the server to expose the metrics
start_http_server(8000)
# Fetch sync operation durations every 60 seconds
while True:
fetch_sync_durations()
time.sleep(60)
That would define a Gauge metric named argocd_sync_duration_seconds
, created to track the duration of sync operations.
The fetch_sync_durations
function fetches the sync operation durations and sets the value of the Gauge metric for each application.
The script then starts a HTTP server to expose the metrics and runs the fetch_sync_durations function every 60 seconds.
Finally, configure an alert manager to send notifications based on the alert. You can follow the Prometheus Alertmanager documentation to set up the alert manager and configure the notification channels (e.g., Slack, email, etc.).
A bit convoluted, but that would enable a custom alert for long-running sync operations in Argo CD.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论