英文:
Question about the salloc command: Where does it execute?
问题
I have a question about the salloc command in a cluster environment. When I execute the salloc command on the login node using salloc -n 1 --gpus=1 hostname, it still displays the hostname of the login node instead of the compute node's hostname. I expected to get the hostname of the compute node instead. Similarly, when I execute salloc -n 1 --gpus=1, it executes the /bin/bash on the login node with resources allocated.
My question is, if the command is not a shell like /bin/bash, does the salloc command have any effect? Will it only allocate resources and execute the command on the login node, without utilizing the compute nodes? It seems like salloc only utilizes the compute nodes when executing shell commands.
I would appreciate any clarification on this matter. Thank you.
英文:
I have a question about the salloc command in a cluster environment. When I execute the salloc command on the login node using salloc -n 1 --gpus=1 hostname, it still displays the hostname of the login node instead of the compute node's hostname. I expected to get the hostname of the compute node instead. Similarly, when I execute salloc -n 1 --gpus=1, it executes the /bin/bash on the login node with resources allocated.
My question is, if the command is not a shell like /bin/bash, does the salloc command have any effect? Will it only allocate resources and execute the command on the login node, without utilizing the compute nodes? It seems like salloc only utilizes the compute nodes when executing shell commands.
I would appreciate any clarification on this matter. Thank you.
答案1
得分: 1
使用默认配置,salloc 只会创建一个分配,即请求资源并阻塞直到资源可用,并在登录节点上启动一个 shell,而不是在分配的节点上。然后,在该 shell 中,您可以使用 srun 或 mpirun 启动并行程序,进程将在分配的节点上运行。或者您可以运行:
srun --pty /bin/bash -l
然后,您将在分配的节点上运行一个 shell。
或者,这已经是官方建议的方法已经有一段时间了,您可以直接使用 srun 命令(即不在 salloc 会话中使用),如下所示:
srun -n 1 --gpus=1 --pty /bin/bash -l
以获得相同的结果。
这已经让用户困惑了很长时间,特别是因为 Slurm 曾经建议在 slurm.conf 中定义 SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL",这会在用户运行 salloc 命令时自动启动一个 srun 会话。
在较新的版本中,Slurm 有一个选项 LaunchParameters=use_interactive_step,意味着它将成为默认选项,并且将使 salloc 成为用于在分配的第一个节点上获取 shell 的命令,同时正确处理 cgroups 和 tasksets。
英文:
With the default configuration, the salloc will only create an allocation, that is request resources and block until the resources are available, and start a shell on the login node, not on the allocated node. Then, in that shell, you can start a parallel program with srun or mpirun and the processes will run on the allocated nodes. Or you can run
srun --pty /bin/bash -l
and you will have a shell running on the allocated node.
Alternatively, and this has been the official recommended way for some time, you can use the srun command directly (i.e. not in a salloc session) like this:
srun -n 1 --gpus=1 --pty /bin/bash -l
for the same result.
This has confused users for a long time, especially since Slurm used to have a recommendation to define SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL" in the slurm.conf which had the effect of starting an srun session automatically when the user ran the salloc command.
In the newer versions, Slurm has an option LaunchParameters=use_interactive_step that is meant to become the default and will make salloc the command to use to get a shell on the first node of the allocation, while at the same time properly handling cgroups and tasksets.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论