英文:
Snakemake syntax for multiple outputs with the use of checkpoint
问题
I'm using snakemake to build a pipeline. I have a checkpoint that should produce multiple output files. These output files are later used in my rule all within expand. The thing is that I don't know the amount of files that will be produced and therefore can't specify a dataset in expand.
The files will be produced in a R-script.
Example:
rule all:
input:
expand(["results/{output}"],
output=????)
checkpoint rscript:
input:
"foo.input"
output:
report("somedir/{output}"),
script:
"../scripts/foo.R"
Of course this is only a small part but I basically have a loop in my R-script to output multiple files in the somedir. But since I don't know how many and because they are firstly evaluated in the R script I can't set output in expand.
Maybe this is a really trivial question to some of you, or even a stupid question and there are better ways to do this. If that's the case I'd still be thankful cause I had problems understanding most of the snakemake functions because of my ability to comprehend the functions in English.
If there are more questions I'd gladly answer. (The best case for me would be to let output have names that I could specify in runtime within the R script)
(I also can't aggregate the created files in another rule because each file will show a different plot)
Edit: The main problem still seems to be that checkpoint rscript is not able to create multiple {output} files in "somedir/". The attempt with touch("rscript_finish.flag") seems to output only the svg File as "rscript_finish.flag" or seems to override "rscript_finish.flag" each time the loop in my rscript writes into snakemake@output[[1]].
英文:
I'm using snakemake to build a pipeline. I have a checkpoint that should produce multiple output files. These output files are later used in my rule all within expand. The thing is that I don't know the amount of files that will be produced and therefore can't specify a dataset in expand.
The files will be produced in a R-script.
Example:
rule all:
input:
expand(["results/{output}],
output=????)
checkpoint rscript:
input:
"foo.input"
output:
report("somedir/{output}"),
script:
"../scripts/foo.R"
Of course this is only a small part but I basically have a loop in my R-script to output multiple files in the somedir. But since I don't know how many and because they are firstly evaluated in the R script I can't set output in expand.
Maybe this is a really trivial question to some of you, or even a stupid question and there are better ways to do this. If that's the case I'd still be thankful cause I had problems understanding most of the snakemake functions because of my ability to comprehend the functions in english.
If there are more questions I'd gladly answer. (The best case for me would be to let output have names that I could specify in runtime within the R script)
(I also can't aggregate the created files in another rule, because each file will show a different plot)
Edit: The main problem still seems to be that checkpoint rscript is not able to create multiple {output} files in "somedir/". The attempt with touch("rscript_finish.flag") seems to output only the svg File as "rscript_finish.flag" or seems to override "rscript_finish.flag" each time the loop in my rscript writes into snakemake@output[[1]].
答案1
得分: 2
没有愚蠢的问题 :). 我希望我理解了,并且这实际上不是一个微不足道的问题!
def all_input(wildcards):
checkpoints.rscript.get() # 确保执行 checkpoint rscript
filenames, = glob_wildcards("somedir/{filenames}.png") # 找到 rscript 生成的所有输出文件
return expand("somedir_cp/{fn}", fn=filenames)
rule all:
input:
all_input
rule add_to_report:
input:
"somedir/{filename}.png"
output:
report("somedir_cp/{filename}.png")
shell:
"cp {input} {output}"
checkpoint rscript:
input:
"foo.input"
output:
touch("rscript_finish.flag")
script:
"../scripts/foo.R"
我没有真正测试这段代码,所以我不确定它是否立即生效,但我认为逻辑是正确的。
需要解决这个问题的方法是使用额外的规则,我称之为 add_to_report
。这个规则的作用是复制 rscript
的现有输出,并将其添加到报告中。rule all
的工作方式是首先调用执行 checkpoint rscript
。一旦它执行完毕,就会找到它生成的所有文件。然后,它指定 rule all
需要作为输入每个 rscript
生成的文件的副本,这将由 rule add_to_report
创建,因此文件将添加到报告中。
英文:
There are no stupid questions :). I hope I understood, and it was actually not a trivial question at all!
def all_input(wildcards):
checkpoints.rscript.get() # make sure that checkpoint rscript is executed
filenames, = glob_wildcards("somedir/{filenames}.png") # find all the output_files of rscript
return expand("somedir_cp/{fn}", fn=filenames)
rule all:
input:
all_input
rule add_to_report:
input:
"somedir/{filename}.png"
output:
report("somedir_cp/{filename}.png")
shell:
"cp {input} {output}"
checkpoint rscript:
input:
"foo.input"
output:
touch("rscript_finish.flag")
script:
"../scripts/foo.R"
I didn't really test the code, so I am not sure if it immediatly works, but I think the logic is correct.
The way this needs to be solved is with an extra rule, which I called add_to_report
. All this rule does is make a copy of the existing output of rscript
, and adds it to the report. The way rule all
works is that it first calls for the execution of checkpoint rscript
. Once that one is executed it finds all the files it generated. Then it says that rule all
needs as input the copy of each file rscript
generated, which will be made by rule add_to_report
, and thus the files are added to the report.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论