如何使用tcsh从for循环中提交带参数的并行(Python)SLURM作业?

huangapple go评论63阅读模式
英文:

How to submit parallel (Python) SLURM jobs with arguments in a for loop from tcsh?

问题

我有一个Python脚本。我想在for循环中运行4个略有不同的版本(脚本中只有一个变量不同)。当每个版本运行完成后,我希望运行相同的脚本,但另一个变量改变了,因为在下一个循环中,它们将使用由前一个脚本生成的HDF5输出。

我的基本问题是,当涉及到tcsh或bash时,我的知识不够。我需要从tcsh中运行它,因为我要将环境变量和加载到Python脚本中的模块传递给它们(这些在`.cshrc`文件中设置)。

另一个问题是,到目前为止,我习惯于逐个提交作业到SLURM,语法为`./python_script1.py > output1.out`(我不关心`output1.out`文件,但拥有它很好)。我找到了许多类似的解决方案,但它们都在for循环中使用`srun`命令。

我花了几个小时尝试,试图拼凑一个非常基本的bash脚本,并尝试在tcsh中运行它,但失败了。我得到了一堆找不到命令的错误。我明白语法不太一样。相关的代码部分如下:

#!/bin/tcsh
#SBATCH --job-name=looptest ## 作业名称
#SBATCH --output=looptest.out ## 输出文件
#SBATCH --get-user-env
OUTPUT = file

for i in seq 1 3; do
for j in seq 1 3; do
srun
-N1
--cpus-per-task=48
./slurmtest.py $i $j > "$OUTPUT_$i_$j.out" &
done
done

wait


我明白我可以从Python中用`sys.argv[i]`命令获取脚本名后面的参数。

现在,直到我开始实验之前,我用来运行的SLURM脚本的相关部分看起来像这样:

#!/bin/tcsh
#SBATCH --job-name=job1_1
#SBATCH --get-user-env
./script.py > output1_1.out


然后,我会手动在Python脚本中更改这两个值,并在每次运行之前更改作业名称和输出名称。我的理想Python脚本的一些内部将会是这样的:

import os
import sys

continue_from_number = sys.argv[1]
folder_name = 'folder_' + str(sys.argv[2]) + '/'

for filename in os.listdir(folder_name):
if filename.startswith("output_" + str(continue_from_number) + "_"):
oname = filename

对'oname'进行一些操作,然后

保存一个带有str(continue_from_number + 1)的输出文件名


理想情况下,我希望并行运行多个这样的脚本(只有参数#2在它们之间改变),并且每个循环都等待上一个循环中的作业完成,否则它们将没有输入可以处理。我是否必须使用`--dependency=afterok<jobID1:jobID2:jobID3:jobID4>`语法?

我可以在Python中编写循环,但这不是一个选择,因为我的脚本的运行时间接近我正在运行它们的集群上的作业时间限制。

如果必须使用srun,那很好,但我希望如果可能的话仍然保持在tcsh中。
英文:

I have a Python script. I want to run 4 slightly different versions of it in a for loop (only one variable difference in the script). When each of those finishes I want to run the same scripts but with another variable changed, because in the next loop, they would use HDF5 outputs generated by the previous scripts.

My underlying problem is that I am not very knowledgeable when it comes to tcsh or bash. I need to run it from tcsh because I pass the environment variables and modules which are loaded to the python script (and those are set up in the .cshrc file).

The other problem is until now I used to submit the jobs to SLURM one by one with the syntax ./python_script1.py > output1.out. (The output1.out file I'm not interested in, but it's nice to have.) I found many similar solutions but all of them use the srun command in the for loop.

I spent a couple of hours on this and I have tried to scramble together a very basic bash script and tried to run it in tcsh so it failed. I got a bunch of command not found errors. I understand the syntax is not quite the same. Relevant lines:

#!/bin/tcsh
#SBATCH --job-name=looptest    ## Name of the job
#SBATCH --output=looptest.out  ## Output file
#SBATCH --get-user-env
OUTPUT = file

for i in `seq 1 3`; do
  for j in `seq 1 3`; do
    srun \
      -N1 \
      --cpus-per-task=48 \
      ./slurmtest.py $i $j > "$OUTPUT_$i_$j.out" &
  done
done

wait

I understand that I can get arguments after the script name from python with sys.argv[i] command.

Now the relevant parts of the SLURM script that I used to run until I started experimenting looks like:

#!/bin/tcsh
#SBATCH --job-name=job1_1
#SBATCH --get-user-env
./script.py > output1_1.out

Then I would change the two things manually in the python script and change the job name and output name before every run. Some of the insides of my ideal Python script would look like:

import os
import sys


continue_from_number = sys.argv[1]
folder_name = 'folder_' + str(sys.argv[2]) + '/'

for filename in os.listdir(folder_name):
    if filename.startswith("output_" + str(continue_from_number) + "_"):
        oname = filename

# Do some things with 'oname', then
# save an output file with str(continue_from_number + 1) in its name

Ideally, I would like to run multiple of these scripts parallel (only argument #2 changed between them) in a for loop in a way that each loop waits for the jobs in the previous loop to finish, otherwise they will have no input to work with. Do I have to use the syntax --dependency=afterok<jobID1:jobID2:jobID3:jobID4>?

I could write the loop in Python but that's not an option since the runtime of my scripts is close to the job time limit on the cluster I'm running them on.

If I have to use srun that is fine but I would like to stay in tcsh if that's possible.

答案1

得分: 0

I eventually figured it out. I used a bash script after all, which I called from tcsh with sbatch slurm_script.sh and the tcsh environment variables were passed to the script this way (thanks @yut23). The slurm_script.sh looks like this:

#!/bin/bash
#SBATCH --job-name=cont_loop
#SBATCH --output=cont_loop.out
#SBATCH --time=48:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G      

for i in `seq 1 4`; do
  for j in 10 11; do
    ./cont_loop.py $i $j > "${j}_cont_loop_${i}.out" &
  done
  wait
done

It's important to set the --ntasks flag to the number of jobs you want to run simultaneously, in this example it's 2 because the for loop for j only runs twice in each outer loop.

(Also when testing without the outer loop I tried setting i = 1 which doesn't work in bash because there are spaces around the equal sign, so just use i=1 instead.)

英文:

I eventually figured it out. I used a bash script after all, which I called from tcsh with sbatch slurm_script.sh and the tcsh environment variables were passed to the script this way (thanks @yut23). The slurm_script.sh looks like this:

#!/bin/bash
#SBATCH --job-name=cont_loop
#SBATCH --output=cont_loop.out
#SBATCH --time=48:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G      

for i in `seq 1 4`; do
  for j in 10 11; do
    ./cont_loop.py $i $j> "${j}_cont_loop_${i}.out" &
  done
  wait
done

It's important to set the --ntasks flag to the number of jobs you want to run simultaneously, in this example it's 2 because the for loop for j only runs twice in each outer loop.

(Also when testing without the outer loop I tried setting i = 1 which doesn't work in bash because there are spaces around the equal sign, so just use i=1 instead.)

huangapple
  • 本文由 发表于 2023年5月18日 05:13:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276245.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定