2023年6月15日 06:55:07go评论93阅读模式

英文:

nextflow: avoid work directory being generated despite output folder

问题

I've following code that generates a out.txt in path I decide: results_new/check
The script runs fine, that is out.txt file is generated successfully, however, another out.txt is generated at:

/Users/name/Documents/user/nextflow_scripts/test/work/55/5d31c87911f64f9060cf4560ef381c/out.txt

params.hg38genome = "/Users/sariys01/Downloads/NM.fasta";
params.outdir = './results_new/';

process create_file {

output:
publishDir "${params.outdir}/check", mode: 'copy'
path("out.txt"), emit: json

script:
"""
echo "hello\n"
touch out.txt
echo "$reads\n"
"""
}

workflow {
create_file(params.hg38genome).view()
}

How do I avoid this work folder being generated?

英文:

/Users/name/Documents/user/nextflow_scripts/test/work/55/5d31c87911f64f9060cf4560ef381c/out.txt

params.hg38genome =&quot;/Users/sariys01/Downloads/NM.fasta&quot;
params.outdir = &#39;./results_new/&#39;
process create_file {
	output:
  publishDir &quot;${params.outdir}/check&quot;, mode: &#39;copy&#39;
    path(&quot;out.txt&quot;), emit: json
	script:
	&quot;&quot;&quot;
	echo &quot;hello\n&quot;
	touch out.txt
echo &quot;$reads\n&quot;
	&quot;&quot;&quot;
}
workflow {
create_file(params.hg38genome).view()
}

How do I avoid this work folder being generated?

答案1

得分: 2

以下是翻译好的部分：

无法避免生成 out.txt。

这源于Nextflow的原则。大致来说，当启动一个进程时，会创建一个工作目录。该进程的 input: 中列出的所有文件都链接到该目录中，然后运行进程内的代码。因此，从运行的代码的角度来看，一切似乎发生在一个完全正常的文件目录中。正在运行的代码可以在这个目录中创建一些输出。然后，当进程完成时，会根据需要导出 output: 中列出的文件。在您的情况下，这意味着在您的 publishDir 中制作 out.txt 的副本。

因此，根据构造，只有在工作目录中生成了文件 out.txt，才能将其导出到 publishDir。

现在，问题是为什么要避免在那里生成它？通常，这是在后台发生的事情，您在日常使用Nextflow时甚至不需要考虑它。如果是存储方面的问题，请注意您有两个有用的选项。

首先，工作目录本身是在环境变量 NXF_WORK 中定义的路径创建的。如果希望将其存储在其他位置，可以更改此环境变量。

其次，您可以在流水线运行后轻松删除所有这些“临时”文件，使用以下命令：

nextflow clean

但您不希望系统地这样做，这些文件可能很有用！事实上，在运行进程之前，它将首先检查是否已经存在相应的结果。如果是这样，那么就无需重新运行此进程，您可以简单地重用现有结果。可以使用以下命令来实现这一点：

nextflow run -resume

但是，如果删除工作目录，那么结果将不再可用，您无法 resume 执行。

您可以使用以下脚本轻松查看这一点：

process create_file {
    output:
		path("out.txt")
    script:
    """
    sleep 10
	echo "blabla" > out.txt
    """
}
workflow {
	create_file().view()
}

首次执行它：

nextflow run test.nf

需要10秒才能完成，然后如果运行：

nextflow run test.nf -resume

它会立即完成（您可以注意到工作目录名称保持不变，并且您会收到一个消息，指出它已被“缓存”）。如果删除该文件 out.txt 并重新运行，将再次花费10秒。

英文:

You can't/shouldn't avoid this out.txt from being generated.

This comes from the principles of Nextflow. Roughly, when a process is started, a working directory is created. All the files listed in the input: of that process are linked into that directory, then the code within the process is run. So, from the point of view of the code being run, everything appears to happen in a perfectly normal directory full of files. The code being run can create some amount of output, in this directory it's being run in (i.e. the working directory). Then, when the process is finishing, the files listed in the output: are exported as appropriate. In your case, that means making a copy of out.txt in your publishDir.

So, by construction, the file out.txt can only be exported to the publishDir if it has been generated in the working directory.

Now, the question is why you would want to avoid it being generated there? Typically, it's something that happens in the background, you shouldn't need to even think about it in daily Nextflow usage. If it's storage concerns, note that you have two useful options.

First, the working directory itself is created at the path defined in the environment variable NXF_WORK. You can change this environment variable if you want to store it somewhere else.

Second, you can easily delete all these "temporary" files after the pipeline has run, using:

nextflow clean

But you don't want to do that systematically, these files can be useful! Indeed, before a process is run, it'll first check if the corresponding results already exist. And if so, then there is no need to re-run this process, you can simply reuse the existing results. This can be obtained with:

nextflow run -resume

However, if you delete the working directory, then the results are not available anymore, and you can't resume execution.

This can be seen easily with this script:

process create_file {
    output:
		path(&quot;out.txt&quot;)
    script:
    &quot;&quot;&quot;
    sleep 10
	echo &quot;blabla&quot; &gt; out.txt
    &quot;&quot;&quot;
}
workflow {
	create_file().view()
}

Execute it a first time with

nextflow run test.nf

it takes 10 seconds to run, then if you run

nextflow run test.nf -resume

it finishes immediately (and you can notice that the working directory name stays the same, and you get a message that it's cached). If you delete that file out.txt and re-run, it will take 10 seconds again.

答案2

得分: 2

这不是你想要做的事情。Nextflow 进程旨在在工作目录内独立运行并与其他进程隔离。在共享文件系统上，你可以显然从这个目录外部读取文件和写入文件，但如果以后决定使用 AWS Batch 或 Google Cloud 执行器等，则这显然是不可能的。因此，为确保你的工作流可移植且可以在云端或本地运行，请确保只从已经分配到你的进程工作目录中（如在你的 input 块中定义的）读取文件，并避免将文件写入进程工作目录之外。

注意，publishDir 指令是完全可选的。如果你刚开始使用 Nextflow，可以在准备好决定要发布的文件之前忽略 publishDir 指令。还要注意，只有在 output 块中声明的文件才可以发布到 publishDir。

在这个例子中，'process working directory' 是：

/path/to/work/44/354fa4771aba0090a74332c0a414ad

还要注意，Nextflow 在这个目录内创建了一些 'dot' 文件：

$ ls -ga --time-style=+ /path/to/work/44/354fa4771aba0090a74332c0a414ad/
total 32
drwxr-xr-x 2 users 4096  .
drwxr-xr-x 3 users 4096  ..
-rw-r--r-- 1 users    0  .command.begin
-rw-r--r-- 1 users    0  .command.err
-rw-r--r-- 1 users   23  .command.log
-rw-r--r-- 1 users   23  .command.out
-rw-r--r-- 1 users 3132  .command.run
-rw-r--r-- 1 users   69  .command.sh
-rw-r--r-- 1 users    1  .exitcode
lrwxrwxrwx 1 users   73  NM.fasta -> /Users/name/Downloads/NM.fasta
-rw-r--r-- 1 users    0  out.txt

希望这能帮助你理解这个例子。

英文:

This is not something that you want to do. Nextflow processes are intended to be run independently and isolated from each other from inside the working directory. On a shared filesystem, you can obviously read from and write to files outside of this directory, but this of course would not be possible if you later decided to use the AWS Batch or Google Cloud executors for example. So to ensure your workflow is portable and can be run in the cloud or locally, just make sure to only ever read from files that have been staged into your process working directory (as defined in your input block) and avoid writing to files outside of the process working directory.

Note that the publishDir directive is entirely optional. If you're just starting out with Nextflow, you can ignore the publishDir directive until you're ready to decide on what files it is that you would like your workflow to publish. Note also that only files declared in the output block can be published to the publishDir.

params.hg38genome = &#39;/Users/name/Downloads/NM.fasta&#39;
params.outdir = &#39;./results_new/&#39;
process create_file {
    publishDir &quot;${params.outdir}/check&quot;, mode: &#39;copy&#39;
    debug true
    input:
    path fasta
    output:
    path &quot;out.txt&quot;
    script:
    &quot;&quot;&quot;
    echo &quot;staged files:&quot;
    ls -1 &quot;${fasta}&quot;
    touch out.txt
    &quot;&quot;&quot;
}
workflow {
    hg38genome = file( params.hg38genome )
    create_file( hg38genome )
    create_file.out.view()
}

Results:

$ nextflow run main.nf 
N E X T F L O W  ~  version 23.04.1
Launching `main.nf` [sad_franklin] DSL2 - revision: d6d4c2b069
executor &gt;  local (1)
[44/354fa4] process &gt; create_file [100%] 1 of 1 ✔
/path/to/work/44/354fa4771aba0090a74332c0a414ad/out.txt
staged files:
NM.fasta

The 'process working directory' in this example is:

/path/to/work/44/354fa4771aba0090a74332c0a414ad

Note also that Nextflow creates a number of 'dot' files inside this directory:

$ ls -ga --time-style=&#39;+&#39; /path/to/work/44/354fa4771aba0090a74332c0a414ad/
total 32
drwxr-xr-x 2 users 4096  .
drwxr-xr-x 3 users 4096  ..
-rw-r--r-- 1 users    0  .command.begin
-rw-r--r-- 1 users    0  .command.err
-rw-r--r-- 1 users   23  .command.log
-rw-r--r-- 1 users   23  .command.out
-rw-r--r-- 1 users 3132  .command.run
-rw-r--r-- 1 users   69  .command.sh
-rw-r--r-- 1 users    1  .exitcode
lrwxrwxrwx 1 users   73  NM.fasta -&gt; /Users/name/Downloads/NM.fasta
-rw-r--r-- 1 users    0  out.txt

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

避免生成工作目录，尽管有输出文件夹。

问题

答案1

答案2

nextflow – spltiCSV – each element – error : 如果需要重复使用相同的组件

why I got Cannot invoke method view() on null object

Spring-Integration DSL transform() 方法使用 bean 的名称作为转换器。

我如何将Nextflow指向正确的Singularity二进制文件？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。