在Python脚本内部使用mpirun运行独立的SLURM作业 – 递归调用错误

huangapple go评论80阅读模式
英文:

Running an Independent SLURM Job with mpirun Inside a Python Script - Recursive Call Error

问题

我目前正在使用一个需要MPI运行的Python脚本。该脚本在一个SLURM系统上运行。

为了运行我的Python脚本,我定义了要使用的节点数,并在我的sbatch提交文件中启动以下命令:

#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --nodes=12
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 1 --report-bindings --bind-to none -oversubscribe python Main.py

这个设置运行良好。然而,我现在想在Python脚本中引入一个额外的任务,需要运行另一个实例的mpirun。由于这个任务需要在不同的节点上运行,我打算提交一个新的sbatch作业,使用以下Python命令运行mpirun:

os.system(f"sbatch execute.sub")

关联的execute.sub提交文件设计如下:

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 60 -oversubscribe --bind-to none other_application

然而,当我尝试这样做时,我遇到了一个错误消息:“mpirun不支持递归调用”。我对此感到困惑,因为我以为我只是提交了一个独立于任何其他操作的独立作业。

请问有谁可以帮我理解出了什么问题以及如何纠正它?先谢谢了。

英文:

I'm currently using a Python script that requires MPI to operate. This script runs on a SLURM system.

In order to run my Python script, I define the number of nodes to use and launch the following command within my sbatch submission file:

#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --nodes=12
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 1 --report-bindings --bind-to none -oversubscribe python Main.py

This setup is working fine. However, I now want to introduce an additional task within the Python script that requires running another instance of mpirun. Since this task needs to run on different nodes, I thought to submit a new sbatch job with mpirun using the following Python command:

os.system(f"sbatch execute.sub")

The associated execute.sub submission file is designed as follows:

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 60 -oversubscribe --bind-to none other_application

However, when I attempt this, I encounter an error message: "mpirun does not support recursive calls". I'm confused by this, because I was under the impression that I was simply submitting a standalone job, independent of any other operations.

Could anyone please help me understand what's going wrong and how to correct it? Thanks in advance.

答案1

得分: 1

问题是MPI和OpenMPI环境变量通过os.systemsbatch从初始的mpirun命令传播到嵌套命令中。

您可以在Bash中删除它们:

unset "${!OMPI_@}" "${!MPI_@}"

这行可以在execute.sub中的mpirun命令之前设置,或者在os.system命令中像这样设置:

os.system('unset "${!OMPI_@}" "${!MPI_@}" ; sbatch execute.sub')

或者,您还可以使用sbatch--export参数,仅保留您需要的变量:

os.system(f"sbatch --export=PATH,LD_LIBRARY_PATH execute.sub")

请注意,如果您忘记了对它重要的变量,您的子进程可能会失败。

另一个选择是在运行os.system命令之前和之后,在提交Python脚本os.environ中操作环境并恢复它。

英文:

The problem is the MPI and OpenMPI environment variables are propagated from the initial mpirun command to the nested one through os.system and sbatch.

You can remove them in Bash with

unset "${!OMPI_@}" "${!MPI_@}"

That line can be set before the mpirun command in execute.sub, or inside the os.system command like this: os.system('unset "${!OMPI_@}" "${!MPI_@} ; sbatch execute.sub").

Alternatively, you can also use the --export parameter of sbatch to keep only the variables you need:

os.system(f"sbatch --export=PATH,LD_LIRARY_PATH execute.sub')

Note that your subprocess might fail if you forget variables that are important to it.

Another option is to manipulate the environment in the submitting Python script os.environ before running the os.system command and restoring it afterwards.

huangapple
  • 本文由 发表于 2023年7月13日 00:52:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76672866.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定