2023年6月8日 06:47:09go评论52阅读模式

英文:

How to correctly define multiple sets of values as config values in snakemake

问题

我有一个Snakefile，正在循环遍历一组分离物（isolates）来运行2个Python脚本。我想将一组值设置为脚本标志，但我不知道如何在没有Bash脚本的情况下实现这一点。

我使用以下命令来运行它：
snakemake --config window_size=1000000 slide_size=10000 mean_filter_size=99 --cores=9

如何为 --config 设置多组值？
例如，1000000, 10000, 99
1000000, 100, 66 或其他等等。

请注意，我只会回答关于翻译的请求，不会回答其他问题。

英文:

I have a snakefile that is looping through a set of isolates to run 2 python scripts. I want to set a set of values as the script flags but I do not know how to do it without a bash script.

import os
grouphome = os.environ[&#39;GROUPHOME&#39;]
ISOLATES = [i for i in open(grouphome+&#39;/isolates.txt&#39;).read().split(&#39;\n&#39;) if len(i) &gt; 0]
window_size=config[&quot;window_size&quot;]
slide_size=config[&quot;slide_size&quot;]
mean_filter_size =config[&quot;mean_filter_size&quot;]
rule all:
    input:
        expand(grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv&quot;, isolate=ISOLATES,window_size=config[&quot;window_size&quot;],slide_size=config[&quot;slide_size&quot;],mean_filter_size=config[&quot;mean_filter_size&quot;]),
	    expand(grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/ptr.csv&quot;, isolate=ISOLATES,window_size=config[&quot;window_size&quot;],slide_size=config[&quot;slide_size&quot;],mean_filter_size=config[&quot;mean_filter_size&quot;]),
	    expand(grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/residuals_fitted_values.csv&quot;,isolate=ISOLATES,window_size=config[&quot;window_size&quot;],slide_size=config[&quot;slide_size&quot;],mean_filter_size=config[&quot;mean_filter_size&quot;])
rule coverage_preprocessing:
    input:
        input1= grouphome+&quot;/data/{isolate}.fasta&quot;,
        input2= grouphome+&quot;/max_coverage/{isolate}_maximum_coverage.csv&quot;
    params:
        window_size=config[&quot;window_size&quot;],
        slide_size=config[&quot;slide_size&quot;],
        mean_filter_size =config[&quot;mean_filter_size&quot;]
    output: grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv&quot;
    shell:
        &quot;&quot;&quot;./preprocess_coverage.py -i {input.input2} -g {input.input1} -sw {params.slide_size} -ws {params.window_size} -m mean -bm mean -bs {params.mean_filter_size} -o {output}&quot;&quot;&quot;
rule calc_bptr:
    params:
        window_size=config[&quot;window_size&quot;],
        slide_size=config[&quot;slide_size&quot;],
        mean_filter_size =config[&quot;mean_filter_size&quot;]
    input:
        input1= grouphome+&quot;/data/{isolate}.fasta&quot;,
	    input2= grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv&quot;
    output:
        output1= grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/ptr.csv&quot;,
        output2= grouphome+&quot;/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/residuals_fitted_values.csv&quot;,
    shell:
        &quot;&quot;&quot;./calculate_bptr.py -i {input.input2} -b mean -s {params.mean_filter_size} -g {input.input1} -f {output.output2} -o {output.output1}&quot;&quot;&quot;

I am using the following command to run it
snakemake --config window_size=1000000 slide_size=10000 mean_filter_size=99 --cores=9

How do I set multiple sets of values for --config?
e.g., 1000000,10000,99
1000000,100,66 or etc.

答案1

得分: 1

有多个问题需要解决。

首先，我更喜欢将配置文件传递给 Snakemake 调用，例如：snakemake --config config.yaml，其中 config.yaml 包含参数的值。在那里，您可以为参数指定值的列表，例如：

window_size:
  - 1000
  - 2000

slide_size:
  - 5
  - 10

第二个问题是如何使用参数值列表运行规则。我建议在 all 中为所有参数值指定输出，然后下游规则从通配符中获取参数值。


# 此规则请求这两个参数的所有组合，您可以将其作为从配置中传递的列表
rule all:
    input:
        expand(
            "results/{window_size}_{slide_size}.csv",
            window_size=config["window_size"],
            slide_size=config["slide_size"]
        ),

# 此规则从输出参数值中解析，可以使用通配符对象访问这些参数
rule preprocess_coverage:
    input:
        ...,
    output: 
        "results/{window_size}_{slide_size}.csv"
    shell:
        "... -sw {wildcards.slide_size} -ws {wildcards.window_size} -o {output}"

使用上述的 config.yaml 示例，第一个规则 - all 规则 - 请求以下文件：results/1000_5.csv、results/1000_10.csv、results/2000_5.csv 和 results/2000_10.csv。Snakemake 通过设置通配符值来识别规则 preprocess_coverage 可以生成这些文件，然后在规则的 shell 指令中将其作为脚本的参数访问。

有关通配符的更多信息，请参阅此处，有关 expand 函数的更多信息，请参阅此处。

在 expand 函数中，您可以提供任何组合函数以生成结果，例如，要生成仅 results/1000_5.csv 和 results/2000_10.csv，可以使用 zip 函数：

rule all:
    input:
        expand(
            "results/{window_size}_{slide_size}.csv",
            zip,
            window_size=config["window_size"],
            slide_size=config["slide_size"]
        ),

对于 Snakemake 5.31 及更高版本，还有另一种选择：参数空间探索，在Snakemake 手册中有详细描述。ParamSpace 简化了上述方法，但它以更少的代码执行相同的操作。我建议首先尝试上述方法，因为了解上述方法对于理解 ParamSpace 如何工作非常重要。

英文:

There are multiple layers of the problem.

First, I would prefer passing configfile to snakemake call, e.g.: snakemake --config config.yaml with the config.yaml containing values for parameters. There, you can specify a list of values for the parameter, such as:

window_size:
  - 1000
  - 2000

slide_size:
  - 5
  - 10

The second problem is how to run a rule with a list of parameter values.
I would suggest specifying outputs for all parameter values in the all, with the downstream rules taking parameter values from wildcards.


# this rule requests all combinations of these two parameters, that you passed as a list from the config.
rule all:
    input:
        expand(
            &quot;results/{window_size}_{slide_size}.csv&quot;,
            window_size=config[&quot;window_size&quot;],
            slide_size=config[&quot;slide_size&quot;]
        ),

# this rule parses from the output parameter values, this is accessed using wildcards object
rule preprocess_coverage:
    input:
        ...,
    output: 
        &quot;results/{window_size}_{slide_size}.csv&quot;
    shell:
        &quot;... -sw {wildcards.slide_size} -ws {wildcards.window_size} -o {output}&quot;

Using the example config.yaml from above, the first rule - all rule - requests the following files: results/1000_5.csv, results/1000_10.csv, results/2000_5.csv and results/2000_10.csv. Snakemake recognize that the rule preprocess_coverage can produce these files by setting the wildcards values, that are then accessed in the shell directive of the rule as parameters for the script.

See more about wildcards here, and about expand function here.

In expand function you can provide any combinatorial function to produce the results, i.e. to produce only results/1000_5.csv and results/2000_10.csv use the zip function:

rule all:
    input:
        expand(
            &quot;results/{window_size}_{slide_size}.csv&quot;,
            zip,
            window_size=config[&quot;window_size&quot;],
            slide_size=config[&quot;slide_size&quot;]
        ),

With Snakemake 5.31 and above, there is also another option: parameter space exploration, there is a well-written description in the Snakemake manual. ParamSpace simplifies the above approach, but it works the same with less code. I would recommend to try the above first, as it is important to understand the above to understand how ParamSpace works.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何正确定义多组值作为Snakemake配置值的方式

问题

答案1

Snakemake工作流中的通配符生成不同的输出文件。

Snakemake从两个通道中减去一个遮罩。

Snakemake中访问嵌套字典的值的正确方法是什么？

如何使Snakemake通配符适用于空字符串？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论