2023年2月6日 11:29:09go评论236阅读模式

英文:

can SLURM jobs keep running after computer reboot?

问题

我在我的个人电脑的SLURM中运行了一些作业，然后电脑重新启动了。

电脑重新启动后，我在squeue中看到，在重新启动后之前正在运行的作业不再运行，因为它们进入了“drain”状态。看起来它们在重新启动后被自动重新排队。

我无法提交更多的作业，因为节点已被排空。因此，我使用了scancel取消了那些被自动重新排队的作业。

问题是我无法释放节点。我尝试了一些方法：

重新启动slurmctld和slurmd。
尝试像另一个问题中解释的那样“undrain”节点，但没有成功。命令运行没有任何输出（我假设这是好的），但节点的状态没有改变。
然后我尝试手动重新启动系统，看看是否会有任何变化。

运行scontrol show node neuropc得到

[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm@2023-02-05T22:06:33]

奇怪的是，系统监视器显示所有8个核心保持在5%到15%之间的活动，而在进程选项卡中，它只显示一个应用程序（TeamViewer）使用不到4%的处理器。

因此，我怀疑我之前运行的作业在重新启动后仍在运行或仍由SLURM保持挂起。

我使用的是Ubuntu 20.04和slurm 19.05.5。

英文:

I was running some jobs in the SLURM of my PC, and the computer rebooted.

Once the computer was back on, I saw in the squeue that the jobs that were running before reboot were not running anymore due to a drain state. It seemed they had been automatically requeued after the reboot.

I couldn't submit more jobs, because the node was drained. So I did scancel the jobs that were automatically requeued.

The problem is that I cannot free the node. I tried a few things:

Restarting slurmctld and slurmd
"undraining" the nodes as explained in another question, but no success. The commands ran without any output (I assume this is good), but the state of the node did not change.
I then tried manually rebooting the system to see if anything would change

Running scontrol show node neuropc gives

[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm@2023-02-05T22:06:33]

Weirdly, the System Monitor shows that all the 8 cores keep having activity between 5% and 15%, whereas in the Process tab it shows only one app (TeamViewer) using less than of 4% processor.

So I suspect the job I was running somehow was kept running after reboot or are still on hold by SLURM.

I use Ubuntu 20.04 and slurm 19.05.5.

答案1

得分: 1

不能严格回答这个问题；不行。根据Slurm的配置，它们可能会被重新排队，然后从头开始运行，或者从最新的检查点重新启动，如果作业能够进行检查点/恢复的话。但在服务器重新启动时，没有办法让运行中的进程存活下来。

英文:

To strictly answer the question ; no they cannot. They might or might not be requeued depending on the Slurm configuration, and restarted either from scratch or from the latest checkpoint if the job is able to do checkpoint/restart. But there is not way a running process can survive a server reboot.

答案2

得分: 0

这个答案解决了我的问题。将其复制到这里：

> 这可能是因为 slurm.conf 中的 RealMemory=541008 对于您的系统来说太高了。尝试降低该值。假设您确实安装了 541 GB 的 RAM：将其更改为 RealMemory=500000，然后执行 scontrol reconfigure，然后执行 scontrol update nodename=transgen-4 state=resume。
> 如果这有效，您可以尝试稍微提高该值。

英文:

This answer solved my problem. Copying it here:

> This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume.
> If that works, you could try to raise the value a bit.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SLURM作业可以在计算机重启后继续运行吗？

问题

答案1

答案2

在Kubernetes集群中随机终止Pods。

如何告诉Julia在集群上启动作业时利用多个节点？

Confused about SLURM: I SSH to a compute node with a private key, so how SLURM is able to access a compute node if I just add a name to slurm.conf?

多集群CockroachDB与Cilium集群网格

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论