2023年7月11日 00:34:40go评论127阅读模式

英文:

How to tell Julia to utilize more than one node when starting job on cluster?

问题

我正在使用Slurm管理器在我们的集群上在Julia中运行计算。

作业是使用以下形式的Slurm批处理脚本启动的：

#SBATCH --partition=partition
#SBATCH --nodes=2

export JULIA_NUM_THREADS=16

julia --optimize=3 --compile=all --threads=16 --project myScript.jl

在myScript.jl中，我现在使用Distributed.jl包添加多个工作者，每个工作者都利用16个线程（由上面的导出语句设置）执行一些并行计算：

using Distributed

const Nworkers = 10
addprocs(Nworkers - 1)

@sync for n in 1:Nworkers
    @spawnat :any longComputation()
end

当仅使用单个节点时，这可以很好地工作，但是当请求多个节点时，实际上只有一个节点被利用。

如何让Julia在这样的批处理调用中使用所有可用的资源？

英文:

I am running a computation in julia on our cluster using the slurm manager.

The jobs are started using a slurm batch script of the form:

#SBATCH --partition=partition
#SBATCH --nodes=2

export JULIA_NUM_THREADS=16

julia --optimize=3 --compile=all --threads=16 --project myScript.jl

In myScript.jl I am now using the Distributed.jl package to add multiple worders each utilizing 16 threads (set by the export statement above) to perform some parallel computation:

using Distributed

const Nworkers = 10
addprocs(Nworkers - 1)

@sync for n in 1:Nworkers
    @spawnat :any longComputation()
end

This works well when only utilizing a single node, but when requesting more than one node, only one of them is in reality utilized.

How can I bring julia to use all available resource in such a batch call?

答案1

得分: 1

虽然我不使用多线程，但我每天都在SLURM集群上使用Julia的分布式功能跨多个节点。我认为关键问题在于如何启动你的工作进程。SlurmManager在底层创建进程时使用srun，以便它可以跨多个节点使用工作进程。当你只是输入addprocs(10)时，它不知道不在主脚本所在节点上的工作进程。尝试使用以下方式添加进程：

addprocs(SlurmManager())

这将添加与SBATCH中任务数相同数量的进程。对我来说，这个数是256。

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64

和你一样，我也想删除一个进程，以确保托管节点被充分利用。你可以在初始化所有工作进程后简单地使用rmprocs(2)（只移除工作进程号为2的进程）。然而，我注意到你仍然在从1:Nworkers而不是1:(Nworkers-1)的循环中运行。

要在通过addprocs(SlurmManager())生成的工作进程中使用多线程，我怀疑你需要确保向工作进程传递--threads=16。你可能可以直接通过通常的addprocs exeflags处理来实现这一点（请参阅文档）。

英文:

While I don't use multithreading, I do use Distributed with Julia over multiple nodes on a SLURM cluster daily. I think the key problem here is how you are spawning your workers. Under the hood, SlurmManager is using srun when creating the procs so that it can use workers across multiple nodes. When you just type addprocs(10), it has no idea about workers that are not on the node the main script is running. Try adding procs with:

addprocs(SlurmManager())

This will add as many many procs as you have tasks in your SBATCH. For me that is 256.

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64

Like you, I also want to remove one proc to keep the node that is hosting under utilized. You can simply rmprocs(2) (to just remove worker number 2) after you initialize all of the workers. However, I noticed that you are still running a loop from 1:Nworkers instead of 1:(Nworkers-1).

To use multithreading in the workers generated via addprocs(SlurmManager()), I suspect you will need to make sure that --threads=16 is passed to the workers. You may be able to do that directly via the usual addprocs exeflags handling (see docs).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何告诉Julia在集群上启动作业时利用多个节点？

问题

答案1

如何在DifferentialEquations.jl中实现一个积分终止回调以解决ODE？

在Julia中，将结构体对象传递给函数并对其进行更改会分配内存。为什么？

这行代码在Julia编程语言中的意思是什么？

读取文件中的字符串并将其转换为Julia数组。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论