2023年4月13日 20:31:51go评论50阅读模式

英文:

SLURM - forcing MPI to schedule different ranks on different physical CPUs

问题

I am running an experiment on an 8 node cluster under SLURM. Each CPU has 8 physical cores, and is capable of hyperthreading. When running a program with

#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8

mpirun -n 64 bin/hello_world_mpi

it schedules two ranks on the same physical core. Adding the option

#SBATCH --ntasks-per-core=1

gives an error, SLURM saying "Batch job submission failed: Requested node configuration is not available". Is it somehow only allocating 4 physical cores per node? How can I fix this?

英文:

I am running an experiment on an 8 node cluster under SLURM. Each CPU has 8 physical cores, and is capable of hyperthreading. When running a program with

#SBATCH --nodes=8
#SBATCH --ntasks-per-node=8

mpirun -n 64 bin/hello_world_mpi

it schedules two ranks on the same physical core. Adding the option

#SBATCH --ntasks-per-core=1

gives an error, SLURM saying "Batch job submission failed: Requested node configuration is not available". Is it somehow only allocating 4 physical cores per node? How can I fix this?

答案1

得分: 1

你可以使用 sinfo -o%C 命令来查看集群中可用的 CPU 信息。

在文档中，我没有找到关于 SBATCH 的 --ntasks-per-cpu 选项。你可以尝试以下选项来代替 SBATCH：--ntasks-per-core。根据文档：

--ntasks-per-core=
请求每个核心上调用的最大 ntasks。应与 --ntasks 选项一起使用。与节点级别的 --ntasks-per-node 不同，此选项在核心级别而不是节点级别起作用。此选项将被 srun 继承。

你还可以尝试 --cpus-per-task 选项。

c, --cpus-per-task=
告知 Slurm 控制器，后续作业步骤将每个任务需要 ncpus 个处理器。如果没有使用此选项，控制器将尝试为每个任务分配一个处理器。

还请注意：

从 22.05 开始，srun 将不会继承由 salloc 或 sbatch 请求的 --cpus-per-task 值。如果需要为任务(s) 设置此值，必须在调用 srun 时重新请求或使用 SRUN_CPUS_PER_TASK 环境变量进行设置。

英文:

You can check the available CPU information in your cluster using sinfo -o%C.

I wasn't able to find any --ntasks-per-cpu for SBATCH in the documentation. You could try the following options for SBATCH --ntasks-per-core. As per documentation:

> --ntasks-per-core=<ntasks>
> Request the maximum ntasks be invoked on each core. Meant to be used with the --ntasks option. Related to --ntasks-per-node except at
> the core level instead of the node level. This option will be
> inherited by srun.

You could also try --cpus-per-task.
> c, --cpus-per-task=<ncpus>
> Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task. Without this option, the
> controller will just try to allocate one processor per task.

Also please note:

> Beginning with 22.05, srun will not inherit the --cpus-per-task
> value requested by salloc or sbatch. It must be requested again with
> the call to srun or set with the SRUN_CPUS_PER_TASK environment
> variable if desired for the task(s).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SLURM – 强制 MPI 在不同的物理 CPU 上调度不同的进程(rank)

问题

答案1

在Windows CMD中编译并运行C++代码。

MPIRUN尽管有主机文件和SSH访问权限，但未在工作节点上执行。

如何使用tcsh从for循环中提交带参数的并行（Python）SLURM作业？

Confused about SLURM: I SSH to a compute node with a private key, so how SLURM is able to access a compute node if I just add a name to slurm.conf?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论