如何正确定义多组值作为Snakemake配置值的方式

huangapple go评论49阅读模式
英文:

How to correctly define multiple sets of values as config values in snakemake

问题

我有一个Snakefile,正在循环遍历一组分离物(isolates)来运行2个Python脚本。我想将一组值设置为脚本标志,但我不知道如何在没有Bash脚本的情况下实现这一点。

我使用以下命令来运行它:
snakemake --config window_size=1000000 slide_size=10000 mean_filter_size=99 --cores=9

如何为 --config 设置多组值?
例如,1000000, 10000, 99
1000000, 100, 66 或其他等等。

请注意,我只会回答关于翻译的请求,不会回答其他问题。

英文:

I have a snakefile that is looping through a set of isolates to run 2 python scripts. I want to set a set of values as the script flags but I do not know how to do it without a bash script.

import os
grouphome = os.environ['GROUPHOME']
ISOLATES = [i for i in open(grouphome+'/isolates.txt').read().split('\n') if len(i) > 0]
window_size=config["window_size"]
slide_size=config["slide_size"]
mean_filter_size =config["mean_filter_size"]
rule all:
    input:
        expand(grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv", isolate=ISOLATES,window_size=config["window_size"],slide_size=config["slide_size"],mean_filter_size=config["mean_filter_size"]),
	    expand(grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/ptr.csv", isolate=ISOLATES,window_size=config["window_size"],slide_size=config["slide_size"],mean_filter_size=config["mean_filter_size"]),
	    expand(grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/residuals_fitted_values.csv",isolate=ISOLATES,window_size=config["window_size"],slide_size=config["slide_size"],mean_filter_size=config["mean_filter_size"])
rule coverage_preprocessing:
    input:
        input1= grouphome+"/data/{isolate}.fasta",
        input2= grouphome+"/max_coverage/{isolate}_maximum_coverage.csv"
    params:
        window_size=config["window_size"],
        slide_size=config["slide_size"],
        mean_filter_size =config["mean_filter_size"]
    output: grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv"
    shell:
        """./preprocess_coverage.py -i {input.input2} -g {input.input1} -sw {params.slide_size} -ws {params.window_size} -m mean -bm mean -bs {params.mean_filter_size} -o {output}"""
rule calc_bptr:
    params:
        window_size=config["window_size"],
        slide_size=config["slide_size"],
        mean_filter_size =config["mean_filter_size"]
    input:
        input1= grouphome+"/data/{isolate}.fasta",
	    input2= grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv"
    output:
        output1= grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/ptr.csv",
        output2= grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/residuals_fitted_values.csv",
    shell:
        """./calculate_bptr.py -i {input.input2} -b mean -s {params.mean_filter_size} -g {input.input1} -f {output.output2} -o {output.output1}"""

I am using the following command to run it
snakemake --config window_size=1000000 slide_size=10000 mean_filter_size=99 --cores=9

How do I set multiple sets of values for --config?
e.g., 1000000,10000,99
1000000,100,66 or etc.

答案1

得分: 1

有多个问题需要解决。

首先,我更喜欢将配置文件传递给 Snakemake 调用,例如:snakemake --config config.yaml,其中 config.yaml 包含参数的值。在那里,您可以为参数指定值的列表,例如:

window_size:
  - 1000
  - 2000

slide_size:
  - 5
  - 10

第二个问题是如何使用参数值列表运行规则。我建议在 all 中为所有参数值指定输出,然后下游规则从通配符中获取参数值。


# 此规则请求这两个参数的所有组合,您可以将其作为从配置中传递的列表
rule all:
    input:
        expand(
            "results/{window_size}_{slide_size}.csv",
            window_size=config["window_size"],
            slide_size=config["slide_size"]
        ),

# 此规则从输出参数值中解析,可以使用通配符对象访问这些参数
rule preprocess_coverage:
    input:
        ...,
    output: 
        "results/{window_size}_{slide_size}.csv"
    shell:
        "... -sw {wildcards.slide_size} -ws {wildcards.window_size} -o {output}"

使用上述的 config.yaml 示例,第一个规则 - all 规则 - 请求以下文件:results/1000_5.csvresults/1000_10.csvresults/2000_5.csvresults/2000_10.csv。Snakemake 通过设置通配符值来识别规则 preprocess_coverage 可以生成这些文件,然后在规则的 shell 指令中将其作为脚本的参数访问。

有关通配符的更多信息,请参阅此处,有关 expand 函数的更多信息,请参阅此处

expand 函数中,您可以提供任何组合函数以生成结果,例如,要生成仅 results/1000_5.csvresults/2000_10.csv,可以使用 zip 函数:

rule all:
    input:
        expand(
            "results/{window_size}_{slide_size}.csv",
            zip,
            window_size=config["window_size"],
            slide_size=config["slide_size"]
        ),

对于 Snakemake 5.31 及更高版本,还有另一种选择:参数空间探索,在Snakemake 手册中有详细描述。ParamSpace 简化了上述方法,但它以更少的代码执行相同的操作。我建议首先尝试上述方法,因为了解上述方法对于理解 ParamSpace 如何工作非常重要。

英文:

There are multiple layers of the problem.

First, I would prefer passing configfile to snakemake call, e.g.: snakemake --config config.yaml with the config.yaml containing values for parameters. There, you can specify a list of values for the parameter, such as:

window_size:
  - 1000
  - 2000

slide_size:
  - 5
  - 10

The second problem is how to run a rule with a list of parameter values.
I would suggest specifying outputs for all parameter values in the all, with the downstream rules taking parameter values from wildcards.


# this rule requests all combinations of these two parameters, that you passed as a list from the config.
rule all:
    input:
        expand(
            "results/{window_size}_{slide_size}.csv",
            window_size=config["window_size"],
            slide_size=config["slide_size"]
        ),

# this rule parses from the output parameter values, this is accessed using wildcards object
rule preprocess_coverage:
    input:
        ...,
    output: 
        "results/{window_size}_{slide_size}.csv"
    shell:
        "... -sw {wildcards.slide_size} -ws {wildcards.window_size} -o {output}"

Using the example config.yaml from above, the first rule - all rule - requests the following files: results/1000_5.csv, results/1000_10.csv, results/2000_5.csv and results/2000_10.csv. Snakemake recognize that the rule preprocess_coverage can produce these files by setting the wildcards values, that are then accessed in the shell directive of the rule as parameters for the script.

See more about wildcards here, and about expand function here.

In expand function you can provide any combinatorial function to produce the results, i.e. to produce only results/1000_5.csv and results/2000_10.csv use the zip function:

rule all:
    input:
        expand(
            "results/{window_size}_{slide_size}.csv",
            zip,
            window_size=config["window_size"],
            slide_size=config["slide_size"]
        ),

With Snakemake 5.31 and above, there is also another option: parameter space exploration, there is a well-written description in the Snakemake manual. ParamSpace simplifies the above approach, but it works the same with less code. I would recommend to try the above first, as it is important to understand the above to understand how ParamSpace works.

huangapple
  • 本文由 发表于 2023年6月8日 06:47:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427537.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定