Snakemake工作流中的通配符生成不同的输出文件。

huangapple go评论73阅读模式
英文:

Snakemake workflow in which wildcards produce different output files

问题

I am building a snakemake workflow in which certain wildcards(populations) have extra steps not shared by all wildcards. I have 8 populations that run a pedigree-based evaluation, and 6 of these 8 populations run, in addition to the pedigree evaluation, a genomic evaluation. My workflow includes a Python script that only generates a genotype file in the case of a population in the genomic workflow. A summary of the issue is given below. The population CHA runs with the genomic workflow, and the population BEL works with the pedigree-based workflow. In the case of the BEL wildcard, the Python script produces the [dlistAnim, phen_file] files, and in the case of the CHA wildcard, the Python script produces [dlistAnim, phen_file, gen_file].

breeds = {"CHA": "CHAROLAIS",  "BEL":"BELGIAN BLUE"}

rule extract_phenotype_data:
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

The file gen_file is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.

I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.

I would expect a workflow that runs for all wildcards to a certain level and then continues for a subset of the wildcards to the end. Additionally, the workflow should account for the files that may not be present in the pedigree-based workflow.

The snakemake version is 7.25.0

英文:

I am building a snakemake workflow in which certain wildcards(populations) have extra steps not shared by all wildcards. I have 8 populations that run a pedigree-based evaluation, and 6 of these 8 populations run, in addition to the pedigree evaluation, a genomic evaluation. My workflow includes a Python script that only generates a genotype file in the case of a population in the genomic workflow. A summary of the issue is given below. The population CHA runs with the genomic workflow, and the population BEL works with the pedigree-based workflow. In the case of the BEL wildcard, the Python script produces the [dlistAnim, phen_file] files, and in the case of the CHA wildcard, the Python script produces [dlistAnim, phen_file, gen_file].

genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS",  "BEL":"BELGIAN BLUE"}


rule extract_phenotype_data:
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

The file gen_file is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.

I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.

I would expect a workflow that runs for all wildcards to a certain level and then continues for a subset of the wildcards to the end. Additionally, the workflow should account for the files that may not be present in the pedigree-based workflow.

The snakemake version is 7.25.0

答案1

得分: 0

可以通过两个独立的规则、一个输入函数和规则顺序来实现这一点。思路是,对于某些通配符,输入函数将引发异常,从而导致其他规则运行。

genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS",  "BEL":"BELGIAN BLUE"}

ruleorder:
    do_pedigree_with_genome > do_pedigree  # 总是先尝试基因组版本的规则

def do_pedigree_with_genome_input(wildcards):
    if wildcards.breed not in genomic_breeds:
        raise ValueError()  # 这个品种不应该使用这个规则
    return ...

rule do_pedigree_with_genome:
    input:
        do_pedigree_with_genome_input
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"

rule do_pedigree:
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
    ...

所以,CHA 首先尝试使用基因组版本的规则并运行,因为它是genomic_breeds中的一个。BEL 也会尝试使用基因组版本,但是值错误会导致它使用常规版本。

作为副作用,如果尝试使用不在genomic_breeds中的品种的gen_file,snakemake将引发异常,因为没有可用的规则来生成它。

英文:

This can be achieved with two separate rules, an input function, and rule order. The idea is that for some wildcards the input function will raise an exception causing the other rule to run.

genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS",  "BEL":"BELGIAN BLUE"}

ruleorder:
    do_pedigree_with_genome > do_pedigree  # always try the genome version first

def do_pedigree_with_genome_input(wildcards):
    if wildcards.breed not in genomic_breeds:
        raise ValueError()  # this breed shouldn't use this rule
    return ...

rule do_pedigree_with_genome:
    input:
        do_pedigree_with_genome_input
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"

rule do_pedigree:
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
    ...

So CHA will first try to use the genome version of the rule and run because it is one of the genomic_breeds. BEL will also try the genome version, but the value error will cause it to use the regular version instead.

As a side effect, if you try to consume the gen_file of a breed not in the genomic_breeds, snakemake will raise an exception because no rule is available to produce it.

答案2

得分: 0

我使用两个规则与通配符约束指令相结合找到了这个问题的解决方案。

第一个规则生成两个输出,仅适用于通配符不在 genomic_breeds 字典中的情况。

rule extract_phenotype_data:
        wildcard_constraints:
        breed='|'.join([breed for breed in breeds.keys() if not breed 
        in genomic_breeds])
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py 
        --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

第二个规则也使用通配符约束指令,仅适用于通配符 genomic_breeds 字典中的情况。

rule extract_phenotype_data_genomic:
    wildcard_constraints:
        breed='|'.join(list(genomic_breeds.keys()))
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

这两个规则的输出不同,只有一个规则生成 ../genotypes_{{breed}}.txt 文件。

Rule All 可以定义为

rule all:
    input:
        expand(f"../phen_{{breed}}.txt", breed=breeds.keys()),
        expand(f"../listcodeall{{breed}}.txt", breed=breeds.keys()),
        expand(f"../genotypes_{{breed}}.txt", breed =genomic_breeds.keys()),

也许这对某一天的某人有所帮助。

英文:

I found a solution to this question by using two rules in combination with the wildcard constraints directive.

The first rule produces two outputs and only works on wildcards not in the genomic_breeds dictionary.

rule extract_phenotype_data:
        wildcard_constraints:
        breed='|'.join([breed for breed in breeds.keys() if not breed 
        in genomic_breeds])
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py 
        --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

The second rule uses the wildcard constraints directive as well and only works on wildcards in the genomic_breeds dictionary.

rule extract_phenotype_data_genomic:
    wildcard_constraints:
        breed='|'.join(list(genomic_breeds.keys()))
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

The output of both rules are different in that only one of the rules produces the ../genotypes_{{breed}}.txt file.

Rule All can be defined as

rule all:
    input:
        expand(f"../phen_{{breed}}.txt", breed=breeds.keys()),
        expand(f"../listcodeall{{breed}}.txt", breed=breeds.keys()),
        expand(f"../genotypes_{{breed}}.txt", breed =genomic_breeds.keys()),

Perhaps this helps someone someday.

huangapple
  • 本文由 发表于 2023年5月11日 17:58:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76226379.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定