SLURM作业可以在计算机重启后继续运行吗?

huangapple go评论56阅读模式
英文:

can SLURM jobs keep running after computer reboot?

问题

我在我的个人电脑的SLURM中运行了一些作业,然后电脑重新启动了。

电脑重新启动后,我在squeue中看到,在重新启动后之前正在运行的作业不再运行,因为它们进入了“drain”状态。看起来它们在重新启动后被自动重新排队。

我无法提交更多的作业,因为节点已被排空。因此,我使用了scancel取消了那些被自动重新排队的作业。

问题是我无法释放节点。我尝试了一些方法:

  1. 重新启动slurmctldslurmd

  2. 尝试像另一个问题中解释的那样“undrain”节点,但没有成功。命令运行没有任何输出(我假设这是好的),但节点的状态没有改变。

  3. 然后我尝试手动重新启动系统,看看是否会有任何变化。

运行scontrol show node neuropc得到

[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm@2023-02-05T22:06:33]

奇怪的是,系统监视器显示所有8个核心保持在5%到15%之间的活动,而在进程选项卡中,它只显示一个应用程序(TeamViewer)使用不到4%的处理器。

因此,我怀疑我之前运行的作业在重新启动后仍在运行或仍由SLURM保持挂起。

我使用的是Ubuntu 20.04slurm 19.05.5

英文:

I was running some jobs in the SLURM of my PC, and the computer rebooted.

Once the computer was back on, I saw in the squeue that the jobs that were running before reboot were not running anymore due to a drain state. It seemed they had been automatically requeued after the reboot.

I couldn't submit more jobs, because the node was drained. So I did scancel the jobs that were automatically requeued.

The problem is that I cannot free the node. I tried a few things:

  1. Restarting slurmctld and slurmd

  2. "undraining" the nodes as explained in another question, but no success. The commands ran without any output (I assume this is good), but the state of the node did not change.

  3. I then tried manually rebooting the system to see if anything would change

Running scontrol show node neuropc gives

[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm@2023-02-05T22:06:33]

Weirdly, the System Monitor shows that all the 8 cores keep having activity between 5% and 15%, whereas in the Process tab it shows only one app (TeamViewer) using less than of 4% processor.

So I suspect the job I was running somehow was kept running after reboot or are still on hold by SLURM.

I use Ubuntu 20.04 and slurm 19.05.5.

答案1

得分: 1

不能严格回答这个问题;不行。根据Slurm的配置,它们可能会被重新排队,然后从头开始运行,或者从最新的检查点重新启动,如果作业能够进行检查点/恢复的话。但在服务器重新启动时,没有办法让运行中的进程存活下来。

英文:

To strictly answer the question ; no they cannot. They might or might not be requeued depending on the Slurm configuration, and restarted either from scratch or from the latest checkpoint if the job is able to do checkpoint/restart. But there is not way a running process can survive a server reboot.

答案2

得分: 0

这个答案解决了我的问题。将其复制到这里:

> 这可能是因为 slurm.conf 中的 RealMemory=541008 对于您的系统来说太高了。尝试降低该值。假设您确实安装了 541 GB 的 RAM:将其更改为 RealMemory=500000,然后执行 scontrol reconfigure,然后执行 scontrol update nodename=transgen-4 state=resume。
> 如果这有效,您可以尝试稍微提高该值。

英文:

This answer solved my problem. Copying it here:

> This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume.
> If that works, you could try to raise the value a bit.

huangapple
  • 本文由 发表于 2023年2月6日 11:29:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75357089.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定