英文:
Autostart `slurmd` service on computes after reboot
问题
我正在调用 scontrol reboot <nodename>
重启我 SLURM 集群中的计算节点。
通常情况下,重启会超时(从 SLURM 角度来看),并且节点状态设置为 "DOWN"。
(RESUME_TIMEOUT 设置为 300)。
这可能是因为 slurmd
服务在引导后未自动启动。
默认情况下,该服务是 "disabled":
[root@c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
使用 systemctl enable slurmd
激活它在下次重启后不会持续,服务仍然会再次变为 "disabled"。
我认为这是因为更改没有在用于引导的镜像中发生。
如何在计算节点上启用 slurmd
服务,以便它在引导时自动启动并且 scontrol reboot
起作用?
英文:
I am calling scontrol reboot <nodename>
to reboot compute nodes in my SLURM cluster.
The reboot usually times out (seen from SLURM) and the node is set to state "DOWN".
(RESUME_TIMEOUT is set to 300).
This presumably happens because the slurmd
service does not autostart itself after boot.
By default, the service is "disabled":
[root@c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Activating it using systemctl enable slurmd
does not last after the next reboot, the service is again "disabled" then.
I assume this is because the change does not happen in the image which is used for booting.
How can I enable the slurmd
service on the computes so that it starts on boot and scontrol reboot
works?
答案1
得分: 2
这可能不是推荐的方式,但我在工作中设置了一个小集群,并且我修复它的方法是使用一个cron作业:
@reboot /usr/bin/scontrol update nodename=[在这里输入主机名] state=resume
英文:
This is probably not the recommended way, but I setup a mini cluster at work and the way I fixed it was with a cronjob:
@reboot /usr/bin/scontrol update nodename=[put hostname here] state=resume
答案2
得分: 2
我通过OpenHPC邮件列表收到了Antanas Budriūnas的回复,解决了这个问题。
(在主节点上执行)
chroot ////
systemctl enable slurmd
exit
英文:
I got a reply from Antanas Budriūnas via the OpenHPC mailing list which solved the issue.
(execute on master node)
# chroot /<path>/<to>/<cnode>/<image>
# systemctl enable slurmd
# exit
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论