英文:
Nextflow on GCP - waiting on container error
问题
我在Google批处理上使用Nextflow运行管道,但是我遇到以下错误:
ERROR ~ Error executing process > 'PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)'
Caused by:
Process `PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)` terminated with an error exit status (null)
Command executed:
mkdir output
nlrexpress.py \
--input All_Candidate_Soybean_Prots_Simplified_Sorted.fasta \
--outdir ./output \
--module all
mv output/*.short.output.txt ./
Command exit status:
null
Command output:
15/06/2023 15:36:31: ############ NLRexpress started ############
15/06/2023 15:36:31: Input FASTA: All_Candidate_Soybean_Prots_Simplified_Sorted.fasta
15/06/2023 15:36:31: Checking FASTA file - started
15/06/2023 15:36:31: Checking FASTA file - done
15/06/2023 15:36:31: Running JackHMMER - started
Command error:
time="2023-06-15T15:39:22Z" level=error msg="error waiting for container: "
Work dir:
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
模块nf文件如下:
process NLREXPRESS {
tag "$sample_id"
maxForks 1
container = 'dthorbur1990/nlrexpress:latest'
cpus { 4 * task.attempt }
memory { 12.GB * task.attempt }
disk "15.GB"
publishDir(
path: "${params.PlantDir}",
mode: 'copy',
)
input:
tuple val(sample_id), path(peptides)
output:
path "*.short.output.txt", emit: nlre_out
script:
"""
mkdir output
nlrexpress.py \\
--input ${peptides} \\
--outdir ./output \\
--module ${params.NE_Modules}
mv output/*.short.output.txt ./
"""
}
当我在本地运行它时,流程没有错误,并且我已经重新构建了容器,它按预期工作。
让我困惑的是,workDir
不包含 .command.{out,err}
文件,这表明(至少对我来说)它没有在运行。但错误消息的 Command output 部分是该工具的正确前几行。
这是workDir
:
gsutil ls gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.begin
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.run
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.sh
这是关于NLREXPRESS模块的日志文件末尾:
All_Candidate_Soybean_Prots_Simplified_Sorted)","q3Label":"PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)"},{"cpuUsage":null,"process":"ORIENTATION","mem":null,"memUsage":null,"timeUsage":null,"vmem":null,"reads":null,"cpu":null,"time":null,"writes":null}]
我感到困惑。我尝试增加内存但似乎没有起作用。有什么建议吗?如果有帮助的话,我可以添加nextflow.log
文件。
英文:
I'm running a pipeline on using nextflow on google batch. However, I'm getting the following error:
ERROR ~ Error executing process > 'PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)'
Caused by:
Process `PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)` terminated with an error exit status (null)
Command executed:
mkdir output
nlrexpress.py \
--input All_Candidate_Soybean_Prots_Simplified_Sorted.fasta \
--outdir ./output \
--module all
mv output/*.short.output.txt ./
Command exit status:
null
Command output:
15/06/2023 15:36:31: ############ NLRexpress started ############
15/06/2023 15:36:31: Input FASTA: All_Candidate_Soybean_Prots_Simplified_Sorted.fasta
15/06/2023 15:36:31: Checking FASTA file - started
15/06/2023 15:36:31: Checking FASTA file - done
15/06/2023 15:36:31: Running JackHMMER - started
Command error:
time="2023-06-15T15:39:22Z" level=error msg="error waiting for container: "
Work dir:
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
The module nf file is here:
process NLREXPRESS {
tag "$sample_id"
maxForks 1
container = 'dthorbur1990/nlrexpress:latest'
cpus { 4 * task.attempt }
memory { 12.GB * task.attempt }
disk "15.GB"
publishDir(
path: "${params.PlantDir}",
mode: 'copy',
)
input:
tuple val(sample_id), path(peptides)
output:
path "*.short.output.txt", emit: nlre_out
script:
"""
mkdir output
nlrexpress.py \\
--input ${peptides} \\
--outdir ./output \\
--module ${params.NE_Modules}
mv output/*.short.output.txt ./
"""
}
The process was running without error when I ran it locally, and I have rebuilt the container and it works as intended.
What confuses me is that the workDir
doesn't contain either .command.{out,err}
files suggesting (to me at least) that it's not running. But the Command output section of the error message is the correct first few lines of the tool.
Here is the workDir:
gsutil ls gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.begin
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.run
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.sh
And here is the end of the log file regarding the NLREXPRESS module:
All_Candidate_Soybean_Prots_Simplified_Sorted)","q3Label":"PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)"},"writes":null},{"cpuUsage":null,"process":"ORIENTATION","mem":null,"memUsage":null,"timeUsage":null,"vmem":null,"reads":null,"cpu":null,"time":null,"writes":null}]
I'm at a loss. I've tried increasing memory but that hasn't seemed to have worked. Any ideas? Happy to add the nextflow.log
file if that would be helpful.
答案1
得分: 1
我不确定我是否有答案给你,但我认为这种行为可能与Nextflow运行作业的方式有关。如果你查看.command.run
脚本中nxf_main
函数的末尾,你会看到类似以下的内容:
nxf_main() {
...
set +e
ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
local cout=$ctmp/.command.out; mkfifo $cout
local cerr=$ctmp/.command.err; mkfifo $cerr
tee .command.out < $cout &
tee1=$!
tee .command.err < $cerr >&2 &
tee2=$!
( nxf_launch ) >$cout 2>$cerr &
pid=$!
wait $pid || nxf_main_ret=$?
wait $tee1 $tee2
nxf_unstage
}
当启用errexit
(set -e
)时,任何返回非零退出状态的命令都会立即终止脚本。因此,通过使用set +e
,我们明确地禁用了这种行为。这意味着尽管通过nxf_launch
运行Docker容器,.command.out
和.command.err
可能不一定会被创建。
所以我想知道是否与/dev/shm
的大小有问题?Google Cloud Batch支持containerOptions
指令,因此,在第一次尝试时,你可以尝试通过向你的流程定义添加以下内容来增加shm-size2:
process NLREXPRESS {
container 'dthorbur1990/nlrexpress:latest'
containerOptions '--shm-size 2g'
...
}
请注意,在设置container
指令时有一个拼写错误。不确定是否会引发问题,但这里应该避免使用=
字符。在nextflow.config
等地方,确实需要使用=
字符进行赋值语法。
英文:
I'm not sure if I have an answer for you, but I think this behavior might have something to do with how Nextflow runs the job. If you look at the end of the nxf_main
function in the .command.run
script, you'll see something like:
nxf_main() {
...
set +e
ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
local cout=$ctmp/.command.out; mkfifo $cout
local cerr=$ctmp/.command.err; mkfifo $cerr
tee .command.out < $cout &
tee1=$!
tee .command.err < $cerr >&2 &
tee2=$!
( nxf_launch ) >$cout 2>$cerr &
pid=$!
wait $pid || nxf_main_ret=$?
wait $tee1 $tee2
nxf_unstage
}
When errexit
is enabled (set -e
), any command that returns a non-zero exit status immediately terminates the script. So by using set +e
, we are explicitly disabling this behavior. This means that .command.out
and .command.err
may not necessarily be created despite the Docker container being run (via nxf_launch
).
So I wonder if there is a problem with the size of /dev/shm
? Google Cloud Batch supports the containerOptions
directive<sup>1</sup>, so in the first instance, you might like to try bumping the shm-size<sup>2</sup> using something like this to your process definition:
process NLREXPRESS {
container 'dthorbur1990/nlrexpress:latest'
containerOptions '--shm-size 2g'
...
}
Note the typo when setting the container
directive. Not sure if it will cause problems, but the =
character should be avoided here. The assignment syntax using the =
character is indeed required inside your nextflow.config
for example.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论