英文:
Running an Independent SLURM Job with mpirun Inside a Python Script - Recursive Call Error
问题
我目前正在使用一个需要MPI运行的Python脚本。该脚本在一个SLURM系统上运行。
为了运行我的Python脚本,我定义了要使用的节点数,并在我的sbatch提交文件中启动以下命令:
#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --nodes=12
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 1 --report-bindings --bind-to none -oversubscribe python Main.py
这个设置运行良好。然而,我现在想在Python脚本中引入一个额外的任务,需要运行另一个实例的mpirun。由于这个任务需要在不同的节点上运行,我打算提交一个新的sbatch作业,使用以下Python命令运行mpirun:
os.system(f"sbatch execute.sub")
关联的execute.sub
提交文件设计如下:
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 60 -oversubscribe --bind-to none other_application
然而,当我尝试这样做时,我遇到了一个错误消息:“mpirun不支持递归调用”。我对此感到困惑,因为我以为我只是提交了一个独立于任何其他操作的独立作业。
请问有谁可以帮我理解出了什么问题以及如何纠正它?先谢谢了。
英文:
I'm currently using a Python script that requires MPI to operate. This script runs on a SLURM system.
In order to run my Python script, I define the number of nodes to use and launch the following command within my sbatch submission file:
#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --nodes=12
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 1 --report-bindings --bind-to none -oversubscribe python Main.py
This setup is working fine. However, I now want to introduce an additional task within the Python script that requires running another instance of mpirun. Since this task needs to run on different nodes, I thought to submit a new sbatch job with mpirun using the following Python command:
os.system(f"sbatch execute.sub")
The associated execute.sub submission file is designed as follows:
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 60 -oversubscribe --bind-to none other_application
However, when I attempt this, I encounter an error message: "mpirun does not support recursive calls". I'm confused by this, because I was under the impression that I was simply submitting a standalone job, independent of any other operations.
Could anyone please help me understand what's going wrong and how to correct it? Thanks in advance.
答案1
得分: 1
问题是MPI和OpenMPI环境变量通过os.system
和sbatch
从初始的mpirun
命令传播到嵌套命令中。
您可以在Bash中删除它们:
unset "${!OMPI_@}" "${!MPI_@}"
这行可以在execute.sub
中的mpirun
命令之前设置,或者在os.system
命令中像这样设置:
os.system('unset "${!OMPI_@}" "${!MPI_@}" ; sbatch execute.sub')
或者,您还可以使用sbatch
的--export参数,仅保留您需要的变量:
os.system(f"sbatch --export=PATH,LD_LIBRARY_PATH execute.sub")
请注意,如果您忘记了对它重要的变量,您的子进程可能会失败。
另一个选择是在运行os.system
命令之前和之后,在提交Python脚本os.environ
中操作环境并恢复它。
英文:
The problem is the MPI and OpenMPI environment variables are propagated from the initial mpirun
command to the nested one through os.system
and sbatch
.
You can remove them in Bash with
unset "${!OMPI_@}" "${!MPI_@}"
That line can be set before the mpirun
command in execute.sub
, or inside the os.system
command like this: os.system('unset "${!OMPI_@}" "${!MPI_@} ; sbatch execute.sub")
.
Alternatively, you can also use the --export parameter of sbatch
to keep only the variables you need:
os.system(f"sbatch --export=PATH,LD_LIRARY_PATH execute.sub')
Note that your subprocess might fail if you forget variables that are important to it.
Another option is to manipulate the environment in the submitting Python script os.environ
before running the os.system
command and restoring it afterwards.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论