2023年5月11日 17:58:06go评论73阅读模式

英文:

Snakemake workflow in which wildcards produce different output files

问题

I am building a snakemake workflow in which certain wildcards(populations) have extra steps not shared by all wildcards. I have 8 populations that run a pedigree-based evaluation, and 6 of these 8 populations run, in addition to the pedigree evaluation, a genomic evaluation. My workflow includes a Python script that only generates a genotype file in the case of a population in the genomic workflow. A summary of the issue is given below. The population CHA runs with the genomic workflow, and the population BEL works with the pedigree-based workflow. In the case of the BEL wildcard, the Python script produces the [dlistAnim, phen_file] files, and in the case of the CHA wildcard, the Python script produces [dlistAnim, phen_file, gen_file].

breeds = {&quot;CHA&quot;: &quot;CHAROLAIS&quot;,  &quot;BEL&quot;:&quot;BELGIAN BLUE&quot;}

rule extract_phenotype_data:
    input:
        
    params:
        config = &quot;../config_file.yml&quot;,
        breed =f&quot;{{breed}}&quot;
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
        gen_file=f&quot;../genotypes_{{breed}}.txt&quot;
    run:
        cmd = f&quot;python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}&quot;
        shell(cmd)

The file gen_file is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.

I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.

I would expect a workflow that runs for all wildcards to a certain level and then continues for a subset of the wildcards to the end. Additionally, the workflow should account for the files that may not be present in the pedigree-based workflow.

The snakemake version is 7.25.0

英文:

genomic_breeds = {&quot;CHA&quot;: &quot;CHAROLAIS&quot;}
breeds = {&quot;CHA&quot;: &quot;CHAROLAIS&quot;,  &quot;BEL&quot;:&quot;BELGIAN BLUE&quot;}


rule extract_phenotype_data:
    input:
        
    params:
        config = &quot;../config_file.yml&quot;,
        breed =f&quot;{{breed}}&quot;
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
        gen_file=f&quot;../genotypes_{{breed}}.txt&quot;
    run:
        cmd = f&quot;python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}&quot;
        shell(cmd)

The file gen_file is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.

I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.

The snakemake version is 7.25.0

答案1

得分: 0

可以通过两个独立的规则、一个输入函数和规则顺序来实现这一点。思路是，对于某些通配符，输入函数将引发异常，从而导致其他规则运行。

genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS",  "BEL":"BELGIAN BLUE"}

ruleorder:
    do_pedigree_with_genome > do_pedigree  # 总是先尝试基因组版本的规则

def do_pedigree_with_genome_input(wildcards):
    if wildcards.breed not in genomic_breeds:
        raise ValueError()  # 这个品种不应该使用这个规则
    return ...

rule do_pedigree_with_genome:
    input:
        do_pedigree_with_genome_input
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"

rule do_pedigree:
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
    ...

所以，CHA 首先尝试使用基因组版本的规则并运行，因为它是genomic_breeds中的一个。BEL 也会尝试使用基因组版本，但是值错误会导致它使用常规版本。

作为副作用，如果尝试使用不在genomic_breeds中的品种的gen_file，snakemake将引发异常，因为没有可用的规则来生成它。

英文:

This can be achieved with two separate rules, an input function, and rule order. The idea is that for some wildcards the input function will raise an exception causing the other rule to run.

genomic_breeds = {&quot;CHA&quot;: &quot;CHAROLAIS&quot;}
breeds = {&quot;CHA&quot;: &quot;CHAROLAIS&quot;,  &quot;BEL&quot;:&quot;BELGIAN BLUE&quot;}

ruleorder:
    do_pedigree_with_genome &gt; do_pedigree  # always try the genome version first

def do_pedigree_with_genome_input(wildcards):
    if wildcards.breed not in genomic_breeds:
        raise ValueError()  # this breed shouldn&#39;t use this rule
    return ...

rule do_pedigree_with_genome:
    input:
        do_pedigree_with_genome_input
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
        gen_file=f&quot;../genotypes_{{breed}}.txt&quot;

rule do_pedigree:
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
    ...

So CHA will first try to use the genome version of the rule and run because it is one of the genomic_breeds. BEL will also try the genome version, but the value error will cause it to use the regular version instead.

As a side effect, if you try to consume the gen_file of a breed not in the genomic_breeds, snakemake will raise an exception because no rule is available to produce it.

答案2

得分: 0

我使用两个规则与通配符约束指令相结合找到了这个问题的解决方案。

第一个规则生成两个输出，仅适用于通配符不在 genomic_breeds 字典中的情况。

rule extract_phenotype_data:
        wildcard_constraints:
        breed=&#39;|&#39;.join([breed for breed in breeds.keys() if not breed 
        in genomic_breeds])
    input:
        
    params:
        config = &quot;../config_file.yml&quot;,
        breed =f&quot;{{breed}}&quot;
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
    run:
        cmd = f&quot;python /../extract_phenotype_data_for_populations.py 
        --config {params.config} --breed {breeds[params.breed]}&quot;
        shell(cmd)

第二个规则也使用通配符约束指令，仅适用于通配符在 genomic_breeds 字典中的情况。

rule extract_phenotype_data_genomic:
    wildcard_constraints:
        breed=&#39;|&#39;.join(list(genomic_breeds.keys()))
    input:
        
    params:
        config = &quot;../config_file.yml&quot;,
        breed =f&quot;{{breed}}&quot;
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
        gen_file=f&quot;../genotypes_{{breed}}.txt&quot;
    run:
        cmd = f&quot;python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}&quot;
        shell(cmd)

这两个规则的输出不同，只有一个规则生成 ../genotypes_{{breed}}.txt 文件。

Rule All 可以定义为

rule all:
    input:
        expand(f&quot;../phen_{{breed}}.txt&quot;, breed=breeds.keys()),
        expand(f&quot;../listcodeall{{breed}}.txt&quot;, breed=breeds.keys()),
        expand(f&quot;../genotypes_{{breed}}.txt&quot;, breed =genomic_breeds.keys()),

也许这对某一天的某人有所帮助。

英文:

I found a solution to this question by using two rules in combination with the wildcard constraints directive.

The first rule produces two outputs and only works on wildcards not in the genomic_breeds dictionary.

rule extract_phenotype_data:
        wildcard_constraints:
        breed=&#39;|&#39;.join([breed for breed in breeds.keys() if not breed 
        in genomic_breeds])
    input:
        
    params:
        config = &quot;../config_file.yml&quot;,
        breed =f&quot;{{breed}}&quot;
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
    run:
        cmd = f&quot;python /../extract_phenotype_data_for_populations.py 
        --config {params.config} --breed {breeds[params.breed]}&quot;
        shell(cmd)

The second rule uses the wildcard constraints directive as well and only works on wildcards in the genomic_breeds dictionary.

rule extract_phenotype_data_genomic:
    wildcard_constraints:
        breed=&#39;|&#39;.join(list(genomic_breeds.keys()))
    input:
        
    params:
        config = &quot;../config_file.yml&quot;,
        breed =f&quot;{{breed}}&quot;
    output:
        dlistAnim=f&quot;../listcodeall{{breed}}.txt&quot;,
        phen_file=f&quot;../phen_{{breed}}.txt&quot;,
        gen_file=f&quot;../genotypes_{{breed}}.txt&quot;
    run:
        cmd = f&quot;python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}&quot;
        shell(cmd)

The output of both rules are different in that only one of the rules produces the ../genotypes_{{breed}}.txt file.

Rule All can be defined as

rule all:
    input:
        expand(f&quot;../phen_{{breed}}.txt&quot;, breed=breeds.keys()),
        expand(f&quot;../listcodeall{{breed}}.txt&quot;, breed=breeds.keys()),
        expand(f&quot;../genotypes_{{breed}}.txt&quot;, breed =genomic_breeds.keys()),

Perhaps this helps someone someday.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Snakemake工作流中的通配符生成不同的输出文件。

问题

答案1

答案2

为什么我无法导入这个模块？

根据对sleep（）函数的重复调用，给定的代码执行所需的最短时间是多少？

用Python读取具有奇怪分隔符的文件

生成一个 pandas 数据框的直方图，其中列是各个箱子。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论