英文:
How to make Snakemake run a rule once for all matching outputs, not once for each wildcard match
问题
在Snakemake工作流中,出于效率原因,我想要针对通配符匹配的_列表_运行一次规则,而不是每次匹配都运行一次。
在Snakemake中,要实现这个目标,可以使用expand
函数和dynamic
输出。下面是一个示例代码,用于运行rule produce_all_csv
一次以处理所有样本:
from snakemake import expand
rule all:
input:
expand("out_{sample}.csv", sample=["1", "2", "3"])
rule produce_all_csv:
input:
expand("in_{sample}.csv", sample=["1", "2", "3"])
output:
dynamic("out_{sample}.csv")
shell:
"tool --inputs {input} --outputs {output}"
这个示例中,expand
函数用于生成样本的输入文件列表,然后dynamic
输出用于告诉Snakemake,out_{sample}.csv
是一个动态输出,只需运行一次rule produce_all_csv
来处理所有样本。
假设工具的API如下所示:
tool --inputs input_1.csv,input_2.csv --outputs output_1.csv,output_2.csv
这种方法将一次性处理所有样本,而不是每个样本都运行一次。
英文:
In a Snakemake workflow, for efficiency reasons, I want to run a rule once for the list of wildcard matches - rather than once for each match.
What's the idiomatic way of doing this in Snakemake?
This is a minimal starter code that does not what I want, as it would call rule produce_all_csvs
once for each of the required outputs (here 3 times) rather than the desired one time.
rule all:
input:
"out_1.csv",
"out_2.csv",
"out_3.csv",
rule produce_all_csv:
"""
This rule should be called _once_ for _all_ samples
Not once per sample
"""
input:
"in_{sample}.csv",
output:
"out_{sample}.csv",
shell:
"""
# Placeholder for a real command
# that takes a list of input files
# and produces a list of output file
"""
For concreteness, assume the tool has this API:
tool --inputs input_1.csv,input_2.csv --outputs output_1.csv,output_2.csv
This question is inspired by https://stackoverflow.com/questions/75603548/how-to-escape-missingoutputexception-while-running-a-for-loop-in-a-rule-in-snake
答案1
得分: 1
这个呢?
SAMPLES = ['1', '2', '3']
rule all:
input:
"out_1.csv",
"out_2.csv",
"out_3.csv",
rule produce_all_csv:
input:
csv=["in_{sample}.csv" for sample in SAMPLES],
output:
csv=["out_{sample}.csv" for sample in SAMPLES],
params:
in_csv=lambda wc, input: ','.join(input.csv),
out_csv=lambda wc, output: ','.join(output.csv),
shell:
"""
tool --inputs {params.in_csv} --outputs {params.out_csv}
"""
你可以考虑使用 `expand` 函数来代替列表推导式。
英文:
What about this?
SAMPLES = ['1', '2', '3']
rule all:
input:
"out_1.csv",
"out_2.csv",
"out_3.csv",
rule produce_all_csv:
input:
csv=[f"in_{sample}.csv" for sample in SAMPLES],
output:
csv=[f"out_{sample}.csv" for sample in SAMPLES],
params:
in_csv=lambda wc, input: ','.join(input.csv),
out_csv=lambda wc, output: ','.join(output.csv),
shell:
r"""
tool --inputs {params.in_csv} --outputs {params.out_csv}
"""
You could probably use the expand
function instead of the list comprehensions.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论