英文:
How to use trap in my sbatch bash job script in Compute Canada?
问题
我在ComputeCanada中使用SLURM_TMPDIR进行一些密集的I/O操作,如克隆大型存储库、分析其提交历史记录等。但是现在当作业超出分配的时间时,我会丢失SLURM_TMPDIR中的输出文件。我阅读了关于信号捕获的信息这里。但由于我在系统编程方面经验不丰富,也许我的理解不是很准确,因此无法实现我打算的功能。这是我的批处理作业脚本,但它没有捕获并复制输出到我想要的位置。
#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0
#SBATCH --signal=B:SIGUSR1@120
output_file_name=file_0000.jsonl
echo "Start"
function handle_signal()
{
echo 'Moving File'
cp $SLURM_TMPDIR/<output_file_path> <my_compute_canada_directory>
exit 2
}
trap 'handle_signal' SIGUSR1
cd $SLURM_TMPDIR
git clone ...
cd ...
module purge
module load java/17.0.2
module load python/3.10
export JAVA_TOOL_OPTIONS="-Xms256m -Xmx5g"
python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt
python data_collector.py ./data/file_0000.csv $output_file_name
wait
echo "Test"
exit 0
但它甚至不会打印'Moving File'。有人能指导我如何在SLURM_TMPDIR中高效使用信号捕获吗?如果作业超出分配的时间,它应该复制指定的文件,并且在我的python脚本执行完成后也应该复制。谢谢!
英文:
I am using the SLURM_TMPDIR in ComputeCanada to do some intensive I/O operations, like cloning large repositories, analyzing their commit histories, etc. But now when the job runs out of the assigned time, I lose my output file inside SLURM_TMPDIR. I read about signal trapping here. But since I am not that experienced in System programming, maybe my understanding is not very accurate and hence I can't achieve what I intend to. Here is my batch job script but it doesn't trap and copy the output to my desired location.
#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0
#SBATCH --signal=B:SIGUSR1@120
output_file_name=file_0000.jsonl
echo "Start"
function handle_signal()
{
echo 'Moving File'
cp $SLURM_TMPDIR/<output_file_path> <my_compute_canada_directory>
exit 2
}
trap 'handle_signal' SIGUSR1
cd $SLURM_TMPDIR
git clone ...
cd ...
module purge
module load java/17.0.2
module load python/3.10
export JAVA_TOOL_OPTIONS="-Xms256m -Xmx5g"
python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt
python data_collector.py ./data/file_0000.csv $output_file_name
wait
echo "Test"
exit 0
But it doesn't even print 'Moving File'. Can someone please guide me on how to efficiently use Signal Trap in SLURM_TMPDIR? It should copy the specified file if the job runs out of the assigned time and also should copy if my python script is done executing? Thanks!
答案1
得分: 2
在这种情况下,似乎需要运行srun
以发送信号:
在srun
之外:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH --signal=B:SIGUSR1@50
trap 'echo SIGUSR1 1>&2' SIGUSR1
srun sleep 1
dd if=/dev/zero of=/dev/null 2>/dev/null
结果:
slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***
在srun
期间:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH --signal=B:SIGUSR1@50
trap 'echo SIGUSR1 1>&2' SIGUSR1
srun dd if=/dev/zero of=/dev/null 2>/dev/null
结果:
slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1
英文:
It seems that you need to be running srun
for the signal to be sent:
Outside srun
:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH --signal=B:SIGUSR1@50
trap 'echo SIGUSR1 1>&2' SIGUSR1
srun sleep 1
dd if=/dev/zero of=/dev/null 2>/dev/null
Result:
slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***
During srun
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH --signal=B:SIGUSR1@50
trap 'echo SIGUSR1 1>&2' SIGUSR1
srun dd if=/dev/zero of=/dev/null 2>/dev/null
Result:
slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论