Snakemake中使用checkpoint实现多个输出的语法

huangapple go评论82阅读模式
英文:

Snakemake syntax for multiple outputs with the use of checkpoint

问题

I'm using snakemake to build a pipeline. I have a checkpoint that should produce multiple output files. These output files are later used in my rule all within expand. The thing is that I don't know the amount of files that will be produced and therefore can't specify a dataset in expand.

The files will be produced in a R-script.

Example:

rule all:
    input:
        expand(["results/{output}"],
               output=????)


checkpoint rscript:
    input:
        "foo.input"
    output:
        report("somedir/{output}"),
    script:
        "../scripts/foo.R" 

Of course this is only a small part but I basically have a loop in my R-script to output multiple files in the somedir. But since I don't know how many and because they are firstly evaluated in the R script I can't set output in expand.

Maybe this is a really trivial question to some of you, or even a stupid question and there are better ways to do this. If that's the case I'd still be thankful cause I had problems understanding most of the snakemake functions because of my ability to comprehend the functions in English.

If there are more questions I'd gladly answer. (The best case for me would be to let output have names that I could specify in runtime within the R script)

(I also can't aggregate the created files in another rule because each file will show a different plot)

Edit: The main problem still seems to be that checkpoint rscript is not able to create multiple {output} files in "somedir/". The attempt with touch("rscript_finish.flag") seems to output only the svg File as "rscript_finish.flag" or seems to override "rscript_finish.flag" each time the loop in my rscript writes into snakemake@output[[1]].

英文:

I'm using snakemake to build a pipeline. I have a checkpoint that should produce multiple output files. These output files are later used in my rule all within expand. The thing is that I don't know the amount of files that will be produced and therefore can't specify a dataset in expand.

The files will be produced in a R-script.

Example:

rule all:
    input:
        expand(["results/{output}],
               output=????)



checkpoint rscript:
    input:
        "foo.input"
    output:
        report("somedir/{output}"),
    script:
        "../scripts/foo.R" 

Of course this is only a small part but I basically have a loop in my R-script to output multiple files in the somedir. But since I don't know how many and because they are firstly evaluated in the R script I can't set output in expand.

Maybe this is a really trivial question to some of you, or even a stupid question and there are better ways to do this. If that's the case I'd still be thankful cause I had problems understanding most of the snakemake functions because of my ability to comprehend the functions in english.

If there are more questions I'd gladly answer. (The best case for me would be to let output have names that I could specify in runtime within the R script)

(I also can't aggregate the created files in another rule, because each file will show a different plot)

Edit: The main problem still seems to be that checkpoint rscript is not able to create multiple {output} files in "somedir/". The attempt with touch("rscript_finish.flag") seems to output only the svg File as "rscript_finish.flag" or seems to override "rscript_finish.flag" each time the loop in my rscript writes into snakemake@output[[1]].

答案1

得分: 2

没有愚蠢的问题 :). 我希望我理解了,并且这实际上不是一个微不足道的问题!

def all_input(wildcards):
    checkpoints.rscript.get()  # 确保执行 checkpoint rscript
    filenames, = glob_wildcards("somedir/{filenames}.png")  # 找到 rscript 生成的所有输出文件
    return expand("somedir_cp/{fn}", fn=filenames)

rule all:
    input:
        all_input

rule add_to_report:
    input:
        "somedir/{filename}.png"
    output:
        report("somedir_cp/{filename}.png")
    shell:
        "cp {input} {output}"

checkpoint rscript:
    input:
        "foo.input"
    output:
        touch("rscript_finish.flag")
    script:
        "../scripts/foo.R"

我没有真正测试这段代码,所以我不确定它是否立即生效,但我认为逻辑是正确的。

需要解决这个问题的方法是使用额外的规则,我称之为 add_to_report。这个规则的作用是复制 rscript 的现有输出,并将其添加到报告中。rule all 的工作方式是首先调用执行 checkpoint rscript。一旦它执行完毕,就会找到它生成的所有文件。然后,它指定 rule all 需要作为输入每个 rscript 生成的文件的副本,这将由 rule add_to_report 创建,因此文件将添加到报告中。

英文:

There are no stupid questions :). I hope I understood, and it was actually not a trivial question at all!

def all_input(wildcards):
    checkpoints.rscript.get()  # make sure that checkpoint rscript is executed
    filenames, = glob_wildcards("somedir/{filenames}.png")  # find all the output_files of rscript
    return expand("somedir_cp/{fn}", fn=filenames)


rule all:
    input:
        all_input


rule add_to_report:
    input:
        "somedir/{filename}.png"
    output:
        report("somedir_cp/{filename}.png")
    shell:
        "cp {input} {output}"


checkpoint rscript:
    input:
        "foo.input"
    output:
        touch("rscript_finish.flag")
    script:
        "../scripts/foo.R"

I didn't really test the code, so I am not sure if it immediatly works, but I think the logic is correct.

The way this needs to be solved is with an extra rule, which I called add_to_report. All this rule does is make a copy of the existing output of rscript, and adds it to the report. The way rule all works is that it first calls for the execution of checkpoint rscript. Once that one is executed it finds all the files it generated. Then it says that rule all needs as input the copy of each file rscript generated, which will be made by rule add_to_report, and thus the files are added to the report.

huangapple
  • 本文由 发表于 2020年1月7日 02:09:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616894.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定