英文:
Question about the salloc command: Where does it execute?
问题
I have a question about the salloc command in a cluster environment. When I execute the salloc command on the login node using salloc -n 1 --gpus=1
hostname, it still displays the hostname of the login node instead of the compute node's hostname. I expected to get the hostname of the compute node instead. Similarly, when I execute salloc -n 1 --gpus=1
, it executes the /bin/bash on the login node with resources allocated.
My question is, if the command is not a shell like /bin/bash
, does the salloc command have any effect? Will it only allocate resources and execute the command on the login node, without utilizing the compute nodes? It seems like salloc only utilizes the compute nodes when executing shell commands.
I would appreciate any clarification on this matter. Thank you.
英文:
I have a question about the salloc command in a cluster environment. When I execute the salloc command on the login node using salloc -n 1 --gpus=1
hostname, it still displays the hostname of the login node instead of the compute node's hostname. I expected to get the hostname of the compute node instead. Similarly, when I execute salloc -n 1 --gpus=1
, it executes the /bin/bash on the login node with resources allocated.
My question is, if the command is not a shell like /bin/bash
, does the salloc command have any effect? Will it only allocate resources and execute the command on the login node, without utilizing the compute nodes? It seems like salloc only utilizes the compute nodes when executing shell commands.
I would appreciate any clarification on this matter. Thank you.
答案1
得分: 1
使用默认配置,salloc
只会创建一个分配,即请求资源并阻塞直到资源可用,并在登录节点上启动一个 shell,而不是在分配的节点上。然后,在该 shell 中,您可以使用 srun
或 mpirun
启动并行程序,进程将在分配的节点上运行。或者您可以运行:
srun --pty /bin/bash -l
然后,您将在分配的节点上运行一个 shell。
或者,这已经是官方建议的方法已经有一段时间了,您可以直接使用 srun
命令(即不在 salloc
会话中使用),如下所示:
srun -n 1 --gpus=1 --pty /bin/bash -l
以获得相同的结果。
这已经让用户困惑了很长时间,特别是因为 Slurm 曾经建议在 slurm.conf
中定义 SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL"
,这会在用户运行 salloc
命令时自动启动一个 srun
会话。
在较新的版本中,Slurm 有一个选项 LaunchParameters=use_interactive_step
,意味着它将成为默认选项,并且将使 salloc
成为用于在分配的第一个节点上获取 shell 的命令,同时正确处理 cgroups 和 tasksets。
英文:
With the default configuration, the salloc
will only create an allocation, that is request resources and block until the resources are available, and start a shell on the login node, not on the allocated node. Then, in that shell, you can start a parallel program with srun
or mpirun
and the processes will run on the allocated nodes. Or you can run
srun --pty /bin/bash -l
and you will have a shell running on the allocated node.
Alternatively, and this has been the official recommended way for some time, you can use the srun
command directly (i.e. not in a salloc
session) like this:
srun -n 1 --gpus=1 --pty /bin/bash -l
for the same result.
This has confused users for a long time, especially since Slurm used to have a recommendation to define SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL"
in the slurm.conf
which had the effect of starting an srun
session automatically when the user ran the salloc
command.
In the newer versions, Slurm has an option LaunchParameters=use_interactive_step
that is meant to become the default and will make salloc
the command to use to get a shell on the first node of the allocation, while at the same time properly handling cgroups and tasksets.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论