如何防止Slurm在最后一个进程失败或被终止时终止整个作业?

huangapple go评论63阅读模式
英文:

How to prevent slurm from terminating entire job when last process fails or is killed?

问题

如果最后一个进程(第四个)崩溃或被杀死,整个作业将终止。而对于前三个进程则不同,如果我杀死前三个进程中的任何一个,作业会一直等到所有剩余的进程完成执行才会终止。但如果我杀死最后一个进程,它会终止并杀死前三个进程。

我的job_scrpt.sh文件包含以下内容:
python process1 & process2 & process3 & process4

在Slurm中,作业终止的方式(无论成功还是失败)如何工作?

英文:

I am launching 4 process via job script. These 4 processes launch successfully on the slurm host. But if the last process(4th one) crashed or get killed, the whole job get terminated. That's not the case with first 3 process. If I kill any one of first 3 process, the job stays there until all remaining processes complete their execution. But if I killed the last one, it terminates and kill first three processes as well.

my job_scrpt.sh contains:
python process1 & process2 & process3 & process4

how does the job termination(in both successful and failed cases) works in slurm?

答案1

得分: 0

使用以下方式:

process1 &
process2 &
process3 &
process4

脚本将会在`process4`终止时立即终止(它是唯一一个“阻塞”的);而作业也会这样。

你应该这样写:

process1 &
process2 &
process3 &
process4 &
wait


这样可以将所有四个进程发送到后台,并且`wait`命令会阻塞脚本,直到它们全部终止,作业也会如此。

<details>
<summary>英文:</summary>

With 

    process1 &amp;
    process2 &amp;
    process3 &amp;
    process4

the script will terminate as soon as `process4` terminates (it is the only one &quot;blocking&quot;); and so will the job.

You should write

    process1 &amp;
    process2 &amp;
    process3 &amp;
    process4 &amp;
    wait

so as to send all four processes to the background, and the `wait` command will block the script until all of them terminate, and so will the job.

</details>



huangapple
  • 本文由 发表于 2023年5月29日 15:28:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76355412.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定