2023年7月13日 00:52:24go评论154阅读模式

英文:

Running an Independent SLURM Job with mpirun Inside a Python Script - Recursive Call Error

问题

我目前正在使用一个需要MPI运行的Python脚本。该脚本在一个SLURM系统上运行。

为了运行我的Python脚本，我定义了要使用的节点数，并在我的sbatch提交文件中启动以下命令：

#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --nodes=12
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 1 --report-bindings --bind-to none -oversubscribe python Main.py

这个设置运行良好。然而，我现在想在Python脚本中引入一个额外的任务，需要运行另一个实例的mpirun。由于这个任务需要在不同的节点上运行，我打算提交一个新的sbatch作业，使用以下Python命令运行mpirun：

os.system(f"sbatch execute.sub")

关联的execute.sub提交文件设计如下：

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 60 -oversubscribe --bind-to none other_application

然而，当我尝试这样做时，我遇到了一个错误消息：“mpirun不支持递归调用”。我对此感到困惑，因为我以为我只是提交了一个独立于任何其他操作的独立作业。

请问有谁可以帮我理解出了什么问题以及如何纠正它？先谢谢了。

英文:

I'm currently using a Python script that requires MPI to operate. This script runs on a SLURM system.

In order to run my Python script, I define the number of nodes to use and launch the following command within my sbatch submission file:

#!/bin/bash
#SBATCH --time=12:00:00
#SBATCH --nodes=12
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 1 --report-bindings --bind-to none -oversubscribe python Main.py

This setup is working fine. However, I now want to introduce an additional task within the Python script that requires running another instance of mpirun. Since this task needs to run on different nodes, I thought to submit a new sbatch job with mpirun using the following Python command:

os.system(f&quot;sbatch execute.sub&quot;)

The associated execute.sub submission file is designed as follows:

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --qos=qosname
#SBATCH --account=accountname
#SBATCH --partition=partitionname
mpirun -np 60 -oversubscribe --bind-to none other_application

However, when I attempt this, I encounter an error message: "mpirun does not support recursive calls". I'm confused by this, because I was under the impression that I was simply submitting a standalone job, independent of any other operations.

Could anyone please help me understand what's going wrong and how to correct it? Thanks in advance.

答案1

得分: 1

问题是MPI和OpenMPI环境变量通过os.system和sbatch从初始的mpirun命令传播到嵌套命令中。

您可以在Bash中删除它们：

unset "${!OMPI_@}" "${!MPI_@}"

这行可以在execute.sub中的mpirun命令之前设置，或者在os.system命令中像这样设置：

os.system('unset "${!OMPI_@}" "${!MPI_@}" ; sbatch execute.sub')

或者，您还可以使用sbatch的--export参数，仅保留您需要的变量：

os.system(f"sbatch --export=PATH,LD_LIBRARY_PATH execute.sub")

请注意，如果您忘记了对它重要的变量，您的子进程可能会失败。

另一个选择是在运行os.system命令之前和之后，在提交Python脚本os.environ中操作环境并恢复它。

英文:

The problem is the MPI and OpenMPI environment variables are propagated from the initial mpirun command to the nested one through os.system and sbatch.

You can remove them in Bash with

unset &quot;${!OMPI_@}&quot; &quot;${!MPI_@}&quot;

That line can be set before the mpirun command in execute.sub, or inside the os.system command like this: os.system('unset "${!OMPI_@}" "${!MPI_@} ; sbatch execute.sub").

Alternatively, you can also use the --export parameter of sbatch to keep only the variables you need:

os.system(f&quot;sbatch --export=PATH,LD_LIRARY_PATH execute.sub&#39;)

Note that your subprocess might fail if you forget variables that are important to it.

Another option is to manipulate the environment in the submitting Python script os.environ before running the os.system command and restoring it afterwards.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Python脚本内部使用mpirun运行独立的SLURM作业 – 递归调用错误

问题

答案1

如何使我的CSV比较结果适用于三个单独的列，而不是一个列？

GitHub身份验证未使用主电子邮件。

如何在Robot Framework中运行特定的测试用例。

运行Go异步操作并写入映射。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论