英文:
SLURM - forcing MPI to schedule different ranks on different physical CPUs
问题
I am running an experiment on an 8 node cluster under SLURM. Each CPU has 8 physical cores, and is capable of hyperthreading. When running a program with
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
mpirun -n 64 bin/hello_world_mpi
it schedules two ranks on the same physical core. Adding the option
#SBATCH --ntasks-per-core=1
gives an error, SLURM saying "Batch job submission failed: Requested node configuration is not available". Is it somehow only allocating 4 physical cores per node? How can I fix this?
英文:
I am running an experiment on an 8 node cluster under SLURM. Each CPU has 8 physical cores, and is capable of hyperthreading. When running a program with
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8
mpirun -n 64 bin/hello_world_mpi
it schedules two ranks on the same physical core. Adding the option
#SBATCH --ntasks-per-core=1
gives an error, SLURM saying "Batch job submission failed: Requested node configuration is not available". Is it somehow only allocating 4 physical cores per node? How can I fix this?
答案1
得分: 1
你可以使用 sinfo -o%C
命令来查看集群中可用的 CPU 信息。
在文档中,我没有找到关于 SBATCH 的 --ntasks-per-cpu
选项。你可以尝试以下选项来代替 SBATCH:--ntasks-per-core
。根据文档:
--ntasks-per-core=
请求每个核心上调用的最大 ntasks。应与 --ntasks 选项一起使用。与节点级别的 --ntasks-per-node 不同,此选项在核心级别而不是节点级别起作用。此选项将被 srun 继承。
你还可以尝试 --cpus-per-task
选项。
c, --cpus-per-task=
告知 Slurm 控制器,后续作业步骤将每个任务需要 ncpus 个处理器。如果没有使用此选项,控制器将尝试为每个任务分配一个处理器。
还请注意:
从 22.05 开始,srun 将不会继承由 salloc 或 sbatch 请求的 --cpus-per-task 值。如果需要为任务(s) 设置此值,必须在调用 srun 时重新请求或使用 SRUN_CPUS_PER_TASK 环境变量进行设置。
英文:
You can check the available CPU information in your cluster using sinfo -o%C
.
I wasn't able to find any --ntasks-per-cpu
for SBATCH in the documentation. You could try the following options for SBATCH --ntasks-per-core
. As per documentation:
> --ntasks-per-core=<ntasks>
> Request the maximum ntasks be invoked on each core. Meant to be used with the --ntasks option. Related to --ntasks-per-node except at
> the core level instead of the node level. This option will be
> inherited by srun.
You could also try --cpus-per-task
.
> c, --cpus-per-task=<ncpus>
> Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task. Without this option, the
> controller will just try to allocate one processor per task.
Also please note:
> Beginning with 22.05, srun will not inherit the --cpus-per-task
> value requested by salloc or sbatch. It must be requested again with
> the call to srun or set with the SRUN_CPUS_PER_TASK environment
> variable if desired for the task(s).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论