英文:
How to prevent slurm from terminating entire job when last process fails or is killed?
问题
如果最后一个进程(第四个)崩溃或被杀死,整个作业将终止。而对于前三个进程则不同,如果我杀死前三个进程中的任何一个,作业会一直等到所有剩余的进程完成执行才会终止。但如果我杀死最后一个进程,它会终止并杀死前三个进程。
我的job_scrpt.sh文件包含以下内容:
python process1 & process2 & process3 & process4
在Slurm中,作业终止的方式(无论成功还是失败)如何工作?
英文:
I am launching 4 process via job script. These 4 processes launch successfully on the slurm host. But if the last process(4th one) crashed or get killed, the whole job get terminated. That's not the case with first 3 process. If I kill any one of first 3 process, the job stays there until all remaining processes complete their execution. But if I killed the last one, it terminates and kill first three processes as well.
my job_scrpt.sh contains:
python process1 & process2 & process3 & process4
how does the job termination(in both successful and failed cases) works in slurm?
答案1
得分: 0
使用以下方式:
process1 &
process2 &
process3 &
process4
脚本将会在`process4`终止时立即终止(它是唯一一个“阻塞”的);而作业也会这样。
你应该这样写:
process1 &
process2 &
process3 &
process4 &
wait
这样可以将所有四个进程发送到后台,并且`wait`命令会阻塞脚本,直到它们全部终止,作业也会如此。
<details>
<summary>英文:</summary>
With
process1 &
process2 &
process3 &
process4
the script will terminate as soon as `process4` terminates (it is the only one "blocking"); and so will the job.
You should write
process1 &
process2 &
process3 &
process4 &
wait
so as to send all four processes to the background, and the `wait` command will block the script until all of them terminate, and so will the job.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论