如何告诉Julia在集群上启动作业时利用多个节点?

huangapple go评论57阅读模式
英文:

How to tell Julia to utilize more than one node when starting job on cluster?

问题

我正在使用Slurm管理器在我们的集群上在Julia中运行计算。

作业是使用以下形式的Slurm批处理脚本启动的:

#SBATCH --partition=partition
#SBATCH --nodes=2

export JULIA_NUM_THREADS=16

julia --optimize=3 --compile=all --threads=16 --project myScript.jl

myScript.jl中,我现在使用Distributed.jl包添加多个工作者,每个工作者都利用16个线程(由上面的导出语句设置)执行一些并行计算:

using Distributed

const Nworkers = 10
addprocs(Nworkers - 1)

@sync for n in 1:Nworkers
    @spawnat :any longComputation()
end

当仅使用单个节点时,这可以很好地工作,但是当请求多个节点时,实际上只有一个节点被利用。

如何让Julia在这样的批处理调用中使用所有可用的资源?

英文:

I am running a computation in julia on our cluster using the slurm manager.

The jobs are started using a slurm batch script of the form:

#SBATCH --partition=partition
#SBATCH --nodes=2

export JULIA_NUM_THREADS=16

julia --optimize=3 --compile=all --threads=16 --project myScript.jl

In myScript.jl I am now using the Distributed.jl package to add multiple worders each utilizing 16 threads (set by the export statement above) to perform some parallel computation:

using Distributed

const Nworkers = 10
addprocs(Nworkers - 1)

@sync for n in 1:Nworkers
    @spawnat :any longComputation()
end

This works well when only utilizing a single node, but when requesting more than one node, only one of them is in reality utilized.

How can I bring julia to use all available resource in such a batch call?

答案1

得分: 1

虽然我不使用多线程,但我每天都在SLURM集群上使用Julia的分布式功能跨多个节点。我认为关键问题在于如何启动你的工作进程。SlurmManager在底层创建进程时使用srun,以便它可以跨多个节点使用工作进程。当你只是输入addprocs(10)时,它不知道不在主脚本所在节点上的工作进程。尝试使用以下方式添加进程:

addprocs(SlurmManager())

这将添加与SBATCH中任务数相同数量的进程。对我来说,这个数是256。

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64

和你一样,我也想删除一个进程,以确保托管节点被充分利用。你可以在初始化所有工作进程后简单地使用rmprocs(2)(只移除工作进程号为2的进程)。然而,我注意到你仍然在从1:Nworkers而不是1:(Nworkers-1)的循环中运行。

要在通过addprocs(SlurmManager())生成的工作进程中使用多线程,我怀疑你需要确保向工作进程传递--threads=16。你可能可以直接通过通常的addprocs exeflags处理来实现这一点(请参阅文档)。

英文:

While I don't use multithreading, I do use Distributed with Julia over multiple nodes on a SLURM cluster daily. I think the key problem here is how you are spawning your workers. Under the hood, SlurmManager is using srun when creating the procs so that it can use workers across multiple nodes. When you just type addprocs(10), it has no idea about workers that are not on the node the main script is running. Try adding procs with:

addprocs(SlurmManager())

This will add as many many procs as you have tasks in your SBATCH. For me that is 256.

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64

Like you, I also want to remove one proc to keep the node that is hosting under utilized. You can simply rmprocs(2) (to just remove worker number 2) after you initialize all of the workers. However, I noticed that you are still running a loop from 1:Nworkers instead of 1:(Nworkers-1).

To use multithreading in the workers generated via addprocs(SlurmManager()), I suspect you will need to make sure that --threads=16 is passed to the workers. You may be able to do that directly via the usual addprocs exeflags handling (see docs).

huangapple
  • 本文由 发表于 2023年7月11日 00:34:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定