如何在我的Compute Canada的sbatch bash作业脚本中使用trap?

huangapple go评论70阅读模式
英文:

How to use trap in my sbatch bash job script in Compute Canada?

问题

我在ComputeCanada中使用SLURM_TMPDIR进行一些密集的I/O操作,如克隆大型存储库、分析其提交历史记录等。但是现在当作业超出分配的时间时,我会丢失SLURM_TMPDIR中的输出文件。我阅读了关于信号捕获的信息这里。但由于我在系统编程方面经验不丰富,也许我的理解不是很准确,因此无法实现我打算的功能。这是我的批处理作业脚本,但它没有捕获并复制输出到我想要的位置。

#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0   
#SBATCH --signal=B:SIGUSR1@120

output_file_name=file_0000.jsonl
echo "Start"

function handle_signal() 
{
    echo 'Moving File'
    cp $SLURM_TMPDIR/<output_file_path> <my_compute_canada_directory>
    exit 2
}

trap 'handle_signal' SIGUSR1


cd $SLURM_TMPDIR
git clone ...

cd ...

module purge

module load java/17.0.2
module load python/3.10

export JAVA_TOOL_OPTIONS="-Xms256m -Xmx5g"

python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt

python data_collector.py ./data/file_0000.csv $output_file_name

wait

echo "Test"

exit 0

但它甚至不会打印'Moving File'。有人能指导我如何在SLURM_TMPDIR中高效使用信号捕获吗?如果作业超出分配的时间,它应该复制指定的文件,并且在我的python脚本执行完成后也应该复制。谢谢!

英文:

I am using the SLURM_TMPDIR in ComputeCanada to do some intensive I/O operations, like cloning large repositories, analyzing their commit histories, etc. But now when the job runs out of the assigned time, I lose my output file inside SLURM_TMPDIR. I read about signal trapping here. But since I am not that experienced in System programming, maybe my understanding is not very accurate and hence I can't achieve what I intend to. Here is my batch job script but it doesn't trap and copy the output to my desired location.

#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0   
#SBATCH --signal=B:SIGUSR1@120

output_file_name=file_0000.jsonl
echo &quot;Start&quot;

function handle_signal() 
{
    echo &#39;Moving File&#39;
    cp $SLURM_TMPDIR/&lt;output_file_path&gt; &lt;my_compute_canada_directory&gt;
    exit 2
}

trap &#39;handle_signal&#39; SIGUSR1


cd $SLURM_TMPDIR
git clone ...

cd ...

module purge

module load java/17.0.2
module load python/3.10

export JAVA_TOOL_OPTIONS=&quot;-Xms256m -Xmx5g&quot;

python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt

python data_collector.py ./data/file_0000.csv $output_file_name

wait

echo &quot;Test&quot;

exit 0

But it doesn't even print 'Moving File'. Can someone please guide me on how to efficiently use Signal Trap in SLURM_TMPDIR? It should copy the specified file if the job runs out of the assigned time and also should copy if my python script is done executing? Thanks!

答案1

得分: 2

在这种情况下,似乎需要运行srun以发送信号:

srun之外:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap 'echo SIGUSR1 1>&2' SIGUSR1

srun sleep 1
dd if=/dev/zero of=/dev/null 2>/dev/null

结果:

slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***

srun期间:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap 'echo SIGUSR1 1>&2' SIGUSR1

srun dd if=/dev/zero of=/dev/null 2>/dev/null

结果:

slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1
英文:

It seems that you need to be running srun for the signal to be sent:

Outside srun:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap &#39;echo SIGUSR1 1&gt;&amp;2&#39; SIGUSR1

srun sleep 1
dd if=/dev/zero of=/dev/null 2&gt;/dev/null

Result:

slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***

During srun

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap &#39;echo SIGUSR1 1&gt;&amp;2&#39; SIGUSR1

srun dd if=/dev/zero of=/dev/null 2&gt;/dev/null

Result:

slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1

huangapple
  • 本文由 发表于 2023年5月28日 02:06:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76348342.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定