在计算机重新启动后自动启动 `slurmd` 服务

huangapple go评论71阅读模式
英文:

Autostart `slurmd` service on computes after reboot

问题

我正在调用 scontrol reboot <nodename> 重启我 SLURM 集群中的计算节点。

通常情况下,重启会超时(从 SLURM 角度来看),并且节点状态设置为 "DOWN"。
(RESUME_TIMEOUT 设置为 300)。

这可能是因为 slurmd 服务在引导后未自动启动。
默认情况下,该服务是 "disabled":

[root@c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

使用 systemctl enable slurmd 激活它在下次重启后不会持续,服务仍然会再次变为 "disabled"。
我认为这是因为更改没有在用于引导的镜像中发生。

如何在计算节点上启用 slurmd 服务,以便它在引导时自动启动并且 scontrol reboot 起作用?

英文:

I am calling scontrol reboot &lt;nodename&gt; to reboot compute nodes in my SLURM cluster.

The reboot usually times out (seen from SLURM) and the node is set to state "DOWN".
(RESUME_TIMEOUT is set to 300).

This presumably happens because the slurmd service does not autostart itself after boot.
By default, the service is "disabled":

[root@c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Activating it using systemctl enable slurmd does not last after the next reboot, the service is again "disabled" then.
I assume this is because the change does not happen in the image which is used for booting.

How can I enable the slurmd service on the computes so that it starts on boot and scontrol reboot works?

答案1

得分: 2

这可能不是推荐的方式,但我在工作中设置了一个小集群,并且我修复它的方法是使用一个cron作业:

@reboot /usr/bin/scontrol update nodename=[在这里输入主机名] state=resume
英文:

This is probably not the recommended way, but I setup a mini cluster at work and the way I fixed it was with a cronjob:

@reboot /usr/bin/scontrol update nodename=[put hostname here] state=resume

答案2

得分: 2

我通过OpenHPC邮件列表收到了Antanas Budriūnas的回复,解决了这个问题。

(在主节点上执行)

chroot ////

systemctl enable slurmd

exit

英文:

I got a reply from Antanas Budriūnas via the OpenHPC mailing list which solved the issue.

(execute on master node)
# chroot /&lt;path&gt;/&lt;to&gt;/&lt;cnode&gt;/&lt;image&gt;
# systemctl enable slurmd
# exit

huangapple
  • 本文由 发表于 2020年1月3日 22:35:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580408.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定