英文:
Tasks fail without emitting logs in Airflow
问题
DAG在没有显示日志的情况下失败,但同时你仍然可以从3-4次尝试中启动它们。
我们使用的是PDI运算符(Pentaho数据集成),也就是说,对于每个DAG,在Kubernetes中形成一个单独的Pod来执行它。实际上,日志没有显示,是因为它没有达到生成DAG的PDI的时刻,但我不明白为什么。
最重要的是,在调度器和工作节点的日志中,没有显示任何标记为错误的内容。
我们使用的是:
- Airflow v2.3.1在Kubernetes中(2个Web服务器,2个工作节点,2个调度器)
- Celery执行器(并发数为9,即只有18个,因为有两个工作节点)
- Pentaho数据集成(PDI运算符)
- RabbitMQ 3.8.19也在Kubernetes中
- PostgreSQL 13用于元数据,已经在一个独立的本地机器上
- 每个Pod中还有一个git-sync容器来同步DAG的代码
我们尝试过:
- 将dagbag_import_timeout变量的值增加到300,将dag_file_processor_timeout增加到350,大约一个月后一切都正常运行,但后来问题又出现了
- 再次增加到600和650,分别工作了一个星期,然后又出现问题
- 在加载Pod到Pod调度器、工作节点和Web之前设置了sleep 60(在helmchart的values中称为initialstartupdelay),因为他们怀疑它们没有时间加载使用git-sync pods复制的DAG代码,但这也没有帮助。
英文:
dags fail while no logs are displayed, but at the same time you can still start them from 3-4 attempts.
We use pdiOperator(pentaho data integration), that is, for each dag, a separate pod is formed in the Kubernetes, in which it is executed. That is, in fact, the logs are not displayed, because it does not reach the moment of generating pdi for the dag, but I don’t understand why.
Most importantly, in the logs on the scheduler and worker, nothing is displayed marked error.
We are using:
- airflow v 2.3.1 in Kuber (2 webs, 2 workers, 2 schedulers)
- celery executor (concurrency 9, i.e. only 18, because there are two workers)
- pentaho data integration (pdiOperator)
- rabbitmq 3.8.19 also in Kubernetes
- postgresql 13 for meta already on a separate local machine
- also in each pod there is a git-sync container to synchronize the dags code
What have we tried:
- increased the value of the dagbag_import_timeout variables to 300 and dag_file_processor_timeout to 350 and after about a month everything worked like clockwork, but then it appeared again
- again increased to 600 and 650 respectively, worked for a week and started again
- after that, set sleep 60 (called initialstartupdelay in helmchart's values) before loading the pod on the pod scheduler, worker and web, because they suspected that they did not have time to load the dag code that was copied using git-sync pods, but this did not help either
答案1
得分: 0
这可能是由于任务变成僵尸任务(zombies)引起的。僵尸任务是指Airflow数据库认为正在运行但在一定时间内没有发出心跳信号的任务。有很多原因可能导致任务变成僵尸,例如:
- 工作节点内存不足。
- 工作节点与Airflow数据库之间的网络连接中断。
- 在较新版本的Airflow中修复的错误。
我建议检查一下工作节点是否内存不足。如果不是内存不足的问题,我可能建议升级到较新版本的Airflow,或者深入分析调度器日志,查看任务为何变成僵尸。
英文:
This is probably caused by tasks becoming zombies. A zombie task is one that the Airflow database thinks is running but hasn't emitted a heartbeat for a certain amount of time. There are a lot of things that can cause zombies. Some examples include:
- The worker runs out of memory.
- The network connection between the worker and airflow database is severed.
- Bugs that are fixed in newer versions of Airflow
I recommend checking to see if the worker ran out of memory. If it didn't, I'd probably recommend upgrading to a newer version of Airflow or deep-diving the scheduler logs to see why the task became a zombie.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论