英文:
nextflow: avoid work directory being generated despite output folder
问题
I've following code that generates a out.txt
in path I decide: results_new/check
The script runs fine, that is out.txt
file is generated successfully, however, another out.txt
is generated at:
/Users/name/Documents/user/nextflow_scripts/test/work/55/5d31c87911f64f9060cf4560ef381c/out.txt
params.hg38genome = "/Users/sariys01/Downloads/NM.fasta";
params.outdir = './results_new/';
process create_file {
output:
publishDir "${params.outdir}/check", mode: 'copy'
path("out.txt"), emit: json
script:
"""
echo "hello\n"
touch out.txt
echo "$reads\n"
"""
}
workflow {
create_file(params.hg38genome).view()
}
How do I avoid this work folder being generated?
英文:
I've following code that generates a out.txt
in path I decide: results_new/check
The script runs fine, that is out.txt
file is generated successfully, however, another out.txt
is generated at:
/Users/name/Documents/user/nextflow_scripts/test/work/55/5d31c87911f64f9060cf4560ef381c/out.txt
params.hg38genome ="/Users/sariys01/Downloads/NM.fasta"
params.outdir = './results_new/'
process create_file {
output:
publishDir "${params.outdir}/check", mode: 'copy'
path("out.txt"), emit: json
script:
"""
echo "hello\n"
touch out.txt
echo "$reads\n"
"""
}
workflow {
create_file(params.hg38genome).view()
}
How do I avoid this work folder being generated?
答案1
得分: 2
以下是翻译好的部分:
无法避免生成 out.txt
。
这源于Nextflow的原则。大致来说,当启动一个进程时,会创建一个工作目录。该进程的 input:
中列出的所有文件都链接到该目录中,然后运行进程内的代码。因此,从运行的代码的角度来看,一切似乎发生在一个完全正常的文件目录中。正在运行的代码可以在这个目录中创建一些输出。然后,当进程完成时,会根据需要导出 output:
中列出的文件。在您的情况下,这意味着在您的 publishDir
中制作 out.txt
的副本。
因此,根据构造,只有在工作目录中生成了文件 out.txt
,才能将其导出到 publishDir
。
现在,问题是为什么要避免在那里生成它?通常,这是在后台发生的事情,您在日常使用Nextflow时甚至不需要考虑它。如果是存储方面的问题,请注意您有两个有用的选项。
首先,工作目录本身是在环境变量 NXF_WORK
中定义的路径创建的。如果希望将其存储在其他位置,可以更改此环境变量。
其次,您可以在流水线运行后轻松删除所有这些“临时”文件,使用以下命令:
nextflow clean
但您不希望系统地这样做,这些文件可能很有用!事实上,在运行进程之前,它将首先检查是否已经存在相应的结果。如果是这样,那么就无需重新运行此进程,您可以简单地重用现有结果。可以使用以下命令来实现这一点:
nextflow run -resume
但是,如果删除工作目录,那么结果将不再可用,您无法 resume
执行。
您可以使用以下脚本轻松查看这一点:
process create_file {
output:
path("out.txt")
script:
"""
sleep 10
echo "blabla" > out.txt
"""
}
workflow {
create_file().view()
}
首次执行它:
nextflow run test.nf
需要10秒才能完成,然后如果运行:
nextflow run test.nf -resume
它会立即完成(您可以注意到工作目录名称保持不变,并且您会收到一个消息,指出它已被“缓存”)。如果删除该文件 out.txt
并重新运行,将再次花费10秒。
英文:
You can't/shouldn't avoid this out.txt
from being generated.
This comes from the principles of Nextflow. Roughly, when a process is started, a working directory is created. All the files listed in the input:
of that process are linked into that directory, then the code within the process is run. So, from the point of view of the code being run, everything appears to happen in a perfectly normal directory full of files. The code being run can create some amount of output, in this directory it's being run in (i.e. the working directory). Then, when the process is finishing, the files listed in the output:
are exported as appropriate. In your case, that means making a copy of out.txt
in your publishDir
.
So, by construction, the file out.txt
can only be exported to the publishDir
if it has been generated in the working directory.
Now, the question is why you would want to avoid it being generated there? Typically, it's something that happens in the background, you shouldn't need to even think about it in daily Nextflow usage. If it's storage concerns, note that you have two useful options.
First, the working directory itself is created at the path defined in the environment variable NXF_WORK
. You can change this environment variable if you want to store it somewhere else.
Second, you can easily delete all these "temporary" files after the pipeline has run, using:
nextflow clean
But you don't want to do that systematically, these files can be useful! Indeed, before a process is run, it'll first check if the corresponding results already exist. And if so, then there is no need to re-run this process, you can simply reuse the existing results. This can be obtained with:
nextflow run -resume
However, if you delete the working directory, then the results are not available anymore, and you can't resume
execution.
This can be seen easily with this script:
process create_file {
output:
path("out.txt")
script:
"""
sleep 10
echo "blabla" > out.txt
"""
}
workflow {
create_file().view()
}
Execute it a first time with
nextflow run test.nf
it takes 10 seconds to run, then if you run
nextflow run test.nf -resume
it finishes immediately (and you can notice that the working directory name stays the same, and you get a message that it's cached
). If you delete that file out.txt
and re-run, it will take 10 seconds again.
答案2
得分: 2
这不是你想要做的事情。Nextflow 进程旨在在工作目录内独立运行并与其他进程隔离。在共享文件系统上,你可以显然从这个目录外部读取文件和写入文件,但如果以后决定使用 AWS Batch 或 Google Cloud 执行器等,则这显然是不可能的。因此,为确保你的工作流可移植且可以在云端或本地运行,请确保只从已经分配到你的进程工作目录中(如在你的 input
块中定义的)读取文件,并避免将文件写入进程工作目录之外。
注意,publishDir
指令是完全可选的。如果你刚开始使用 Nextflow,可以在准备好决定要发布的文件之前忽略 publishDir 指令。还要注意,只有在 output
块中声明的文件才可以发布到 publishDir。
在这个例子中,'process working directory' 是:
/path/to/work/44/354fa4771aba0090a74332c0a414ad
还要注意,Nextflow 在这个目录内创建了一些 'dot' 文件:
$ ls -ga --time-style=+ /path/to/work/44/354fa4771aba0090a74332c0a414ad/
total 32
drwxr-xr-x 2 users 4096 .
drwxr-xr-x 3 users 4096 ..
-rw-r--r-- 1 users 0 .command.begin
-rw-r--r-- 1 users 0 .command.err
-rw-r--r-- 1 users 23 .command.log
-rw-r--r-- 1 users 23 .command.out
-rw-r--r-- 1 users 3132 .command.run
-rw-r--r-- 1 users 69 .command.sh
-rw-r--r-- 1 users 1 .exitcode
lrwxrwxrwx 1 users 73 NM.fasta -> /Users/name/Downloads/NM.fasta
-rw-r--r-- 1 users 0 out.txt
希望这能帮助你理解这个例子。
英文:
This is not something that you want to do. Nextflow processes are intended to be run independently and isolated from each other from inside the working directory. On a shared filesystem, you can obviously read from and write to files outside of this directory, but this of course would not be possible if you later decided to use the AWS Batch or Google Cloud executors for example. So to ensure your workflow is portable and can be run in the cloud or locally, just make sure to only ever read from files that have been staged into your process working directory (as defined in your input
block) and avoid writing to files outside of the process working directory.
Note that the publishDir
directive is entirely optional. If you're just starting out with Nextflow, you can ignore the publishDir directive until you're ready to decide on what files it is that you would like your workflow to publish. Note also that only files declared in the output
block can be published to the publishDir.
params.hg38genome = '/Users/name/Downloads/NM.fasta'
params.outdir = './results_new/'
process create_file {
publishDir "${params.outdir}/check", mode: 'copy'
debug true
input:
path fasta
output:
path "out.txt"
script:
"""
echo "staged files:"
ls -1 "${fasta}"
touch out.txt
"""
}
workflow {
hg38genome = file( params.hg38genome )
create_file( hg38genome )
create_file.out.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 23.04.1
Launching `main.nf` [sad_franklin] DSL2 - revision: d6d4c2b069
executor > local (1)
[44/354fa4] process > create_file [100%] 1 of 1 ✔
/path/to/work/44/354fa4771aba0090a74332c0a414ad/out.txt
staged files:
NM.fasta
The 'process working directory' in this example is:
/path/to/work/44/354fa4771aba0090a74332c0a414ad
Note also that Nextflow creates a number of 'dot' files inside this directory:
$ ls -ga --time-style='+' /path/to/work/44/354fa4771aba0090a74332c0a414ad/
total 32
drwxr-xr-x 2 users 4096 .
drwxr-xr-x 3 users 4096 ..
-rw-r--r-- 1 users 0 .command.begin
-rw-r--r-- 1 users 0 .command.err
-rw-r--r-- 1 users 23 .command.log
-rw-r--r-- 1 users 23 .command.out
-rw-r--r-- 1 users 3132 .command.run
-rw-r--r-- 1 users 69 .command.sh
-rw-r--r-- 1 users 1 .exitcode
lrwxrwxrwx 1 users 73 NM.fasta -> /Users/name/Downloads/NM.fasta
-rw-r--r-- 1 users 0 out.txt
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论