How to make Snakemake run a rule once for all matching outputs, not once for each wildcard match

huangapple go评论63阅读模式
英文:

How to make Snakemake run a rule once for all matching outputs, not once for each wildcard match

问题

在Snakemake工作流中,出于效率原因,我想要针对通配符匹配的_列表_运行一次规则,而不是每次匹配都运行一次。

在Snakemake中,要实现这个目标,可以使用expand函数和dynamic输出。下面是一个示例代码,用于运行rule produce_all_csv一次以处理所有样本:

from snakemake import expand

rule all:
    input:
        expand("out_{sample}.csv", sample=["1", "2", "3"])

rule produce_all_csv:
    input:
        expand("in_{sample}.csv", sample=["1", "2", "3"])
    output:
        dynamic("out_{sample}.csv")
    shell:
        "tool --inputs {input} --outputs {output}"

这个示例中,expand函数用于生成样本的输入文件列表,然后dynamic输出用于告诉Snakemake,out_{sample}.csv是一个动态输出,只需运行一次rule produce_all_csv来处理所有样本。

假设工具的API如下所示:

tool --inputs input_1.csv,input_2.csv --outputs output_1.csv,output_2.csv

这种方法将一次性处理所有样本,而不是每个样本都运行一次。

英文:

In a Snakemake workflow, for efficiency reasons, I want to run a rule once for the list of wildcard matches - rather than once for each match.

What's the idiomatic way of doing this in Snakemake?

This is a minimal starter code that does not what I want, as it would call rule produce_all_csvs once for each of the required outputs (here 3 times) rather than the desired one time.

rule all:
    input:
        "out_1.csv",
        "out_2.csv",
        "out_3.csv",


rule produce_all_csv:
    """
    This rule should be called _once_ for _all_ samples
    Not once per sample
    """
    input:
        "in_{sample}.csv",
    output:
        "out_{sample}.csv",
    shell:
        """
        # Placeholder for a real command
        # that takes a list of input files
        # and produces a list of output file
        """

For concreteness, assume the tool has this API:

tool --inputs input_1.csv,input_2.csv --outputs output_1.csv,output_2.csv

This question is inspired by https://stackoverflow.com/questions/75603548/how-to-escape-missingoutputexception-while-running-a-for-loop-in-a-rule-in-snake

答案1

得分: 1

这个呢?

SAMPLES = ['1', '2', '3']

rule all:
input:
"out_1.csv",
"out_2.csv",
"out_3.csv",

rule produce_all_csv:
input:
csv=["in_{sample}.csv" for sample in SAMPLES],
output:
csv=["out_{sample}.csv" for sample in SAMPLES],
params:
in_csv=lambda wc, input: ','.join(input.csv),
out_csv=lambda wc, output: ','.join(output.csv),
shell:
"""
tool --inputs {params.in_csv} --outputs {params.out_csv}
"""


你可以考虑使用 `expand` 函数来代替列表推导式。
英文:

What about this?

SAMPLES = ['1', '2', '3']

rule all:
    input:
        "out_1.csv",
        "out_2.csv",
        "out_3.csv",


rule produce_all_csv:
    input:
        csv=[f"in_{sample}.csv" for sample in SAMPLES],
    output:
        csv=[f"out_{sample}.csv" for sample in SAMPLES],
    params:
        in_csv=lambda wc, input: ','.join(input.csv),
        out_csv=lambda wc, output: ','.join(output.csv),
    shell:
        r"""
        tool --inputs {params.in_csv} --outputs {params.out_csv}
        """

You could probably use the expand function instead of the list comprehensions.

huangapple
  • 本文由 发表于 2023年3月8日 17:00:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75671070.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定