2023年5月28日 02:06:52go评论86阅读模式

英文:

How to use trap in my sbatch bash job script in Compute Canada?

问题

我在ComputeCanada中使用SLURM_TMPDIR进行一些密集的I/O操作，如克隆大型存储库、分析其提交历史记录等。但是现在当作业超出分配的时间时，我会丢失SLURM_TMPDIR中的输出文件。我阅读了关于信号捕获的信息这里。但由于我在系统编程方面经验不丰富，也许我的理解不是很准确，因此无法实现我打算的功能。这是我的批处理作业脚本，但它没有捕获并复制输出到我想要的位置。

#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0   
#SBATCH --signal=B:SIGUSR1@120

output_file_name=file_0000.jsonl
echo "Start"

function handle_signal() 
{
    echo 'Moving File'
    cp $SLURM_TMPDIR/<output_file_path> <my_compute_canada_directory>
    exit 2
}

trap 'handle_signal' SIGUSR1


cd $SLURM_TMPDIR
git clone ...

cd ...

module purge

module load java/17.0.2
module load python/3.10

export JAVA_TOOL_OPTIONS="-Xms256m -Xmx5g"

python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt

python data_collector.py ./data/file_0000.csv $output_file_name

wait

echo "Test"

exit 0

但它甚至不会打印'Moving File'。有人能指导我如何在SLURM_TMPDIR中高效使用信号捕获吗？如果作业超出分配的时间，它应该复制指定的文件，并且在我的python脚本执行完成后也应该复制。谢谢！

英文:

I am using the SLURM_TMPDIR in ComputeCanada to do some intensive I/O operations, like cloning large repositories, analyzing their commit histories, etc. But now when the job runs out of the assigned time, I lose my output file inside SLURM_TMPDIR. I read about signal trapping here. But since I am not that experienced in System programming, maybe my understanding is not very accurate and hence I can't achieve what I intend to. Here is my batch job script but it doesn't trap and copy the output to my desired location.

#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0   
#SBATCH --signal=B:SIGUSR1@120

output_file_name=file_0000.jsonl
echo &quot;Start&quot;

function handle_signal() 
{
    echo &#39;Moving File&#39;
    cp $SLURM_TMPDIR/&lt;output_file_path&gt; &lt;my_compute_canada_directory&gt;
    exit 2
}

trap &#39;handle_signal&#39; SIGUSR1


cd $SLURM_TMPDIR
git clone ...

cd ...

module purge

module load java/17.0.2
module load python/3.10

export JAVA_TOOL_OPTIONS=&quot;-Xms256m -Xmx5g&quot;

python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt

python data_collector.py ./data/file_0000.csv $output_file_name

wait

echo &quot;Test&quot;

exit 0

But it doesn't even print 'Moving File'. Can someone please guide me on how to efficiently use Signal Trap in SLURM_TMPDIR? It should copy the specified file if the job runs out of the assigned time and also should copy if my python script is done executing? Thanks!

答案1

得分: 2

在这种情况下，似乎需要运行srun以发送信号：

在srun之外：

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap 'echo SIGUSR1 1>&2' SIGUSR1

srun sleep 1
dd if=/dev/zero of=/dev/null 2>/dev/null

结果：

slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***

在srun期间：

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap 'echo SIGUSR1 1>&2' SIGUSR1

srun dd if=/dev/zero of=/dev/null 2>/dev/null

结果：

slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1

英文:

It seems that you need to be running srun for the signal to be sent:

Outside srun:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap &#39;echo SIGUSR1 1&gt;&amp;2&#39; SIGUSR1

srun sleep 1
dd if=/dev/zero of=/dev/null 2&gt;/dev/null

Result:

slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***

During srun

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap &#39;echo SIGUSR1 1&gt;&amp;2&#39; SIGUSR1

srun dd if=/dev/zero of=/dev/null 2&gt;/dev/null

Result:

slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在我的Compute Canada的sbatch bash作业脚本中使用trap？

问题

答案1

`execve()`在与”which”一起使用时没有输出到终端。

如何在grep中使用正则表达式匹配多行并且只获取最后匹配的集合？

Bash-awk-parallel 为大文件的每一行选择进程

How to include static variable with time varying variables in a netCDF file using Climate Data Operator (CDO)?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论