英文:
Snakemake workflow in which wildcards produce different output files
问题
I am building a snakemake workflow in which certain wildcards(populations) have extra steps not shared by all wildcards. I have 8 populations that run a pedigree-based evaluation, and 6 of these 8 populations run, in addition to the pedigree evaluation, a genomic evaluation. My workflow includes a Python script that only generates a genotype file in the case of a population in the genomic workflow. A summary of the issue is given below. The population CHA runs with the genomic workflow, and the population BEL works with the pedigree-based workflow. In the case of the BEL wildcard, the Python script produces the [dlistAnim, phen_file]
files, and in the case of the CHA wildcard, the Python script produces [dlistAnim, phen_file, gen_file]
.
breeds = {"CHA": "CHAROLAIS", "BEL":"BELGIAN BLUE"}
rule extract_phenotype_data:
input:
params:
config = "../config_file.yml",
breed =f"{{breed}}"
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
gen_file=f"../genotypes_{{breed}}.txt"
run:
cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
shell(cmd)
The file gen_file
is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.
I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.
I would expect a workflow that runs for all wildcards to a certain level and then continues for a subset of the wildcards to the end. Additionally, the workflow should account for the files that may not be present in the pedigree-based workflow.
The snakemake version is 7.25.0
英文:
I am building a snakemake workflow in which certain wildcards(populations) have extra steps not shared by all wildcards. I have 8 populations that run a pedigree-based evaluation, and 6 of these 8 populations run, in addition to the pedigree evaluation, a genomic evaluation. My workflow includes a Python script that only generates a genotype file in the case of a population in the genomic workflow. A summary of the issue is given below. The population CHA runs with the genomic workflow, and the population BEL works with the pedigree-based workflow. In the case of the BEL wildcard, the Python script produces the [dlistAnim, phen_file]
files, and in the case of the CHA wildcard, the Python script produces [dlistAnim, phen_file, gen_file]
.
genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS", "BEL":"BELGIAN BLUE"}
rule extract_phenotype_data:
input:
params:
config = "../config_file.yml",
breed =f"{{breed}}"
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
gen_file=f"../genotypes_{{breed}}.txt"
run:
cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
shell(cmd)
The file gen_file
is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.
I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.
I would expect a workflow that runs for all wildcards to a certain level and then continues for a subset of the wildcards to the end. Additionally, the workflow should account for the files that may not be present in the pedigree-based workflow.
The snakemake version is 7.25.0
答案1
得分: 0
可以通过两个独立的规则、一个输入函数和规则顺序来实现这一点。思路是,对于某些通配符,输入函数将引发异常,从而导致其他规则运行。
genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS", "BEL":"BELGIAN BLUE"}
ruleorder:
do_pedigree_with_genome > do_pedigree # 总是先尝试基因组版本的规则
def do_pedigree_with_genome_input(wildcards):
if wildcards.breed not in genomic_breeds:
raise ValueError() # 这个品种不应该使用这个规则
return ...
rule do_pedigree_with_genome:
input:
do_pedigree_with_genome_input
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
gen_file=f"../genotypes_{{breed}}.txt"
rule do_pedigree:
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
...
所以,CHA
首先尝试使用基因组版本的规则并运行,因为它是genomic_breeds中的一个。BEL
也会尝试使用基因组版本,但是值错误会导致它使用常规版本。
作为副作用,如果尝试使用不在genomic_breeds中的品种的gen_file,snakemake将引发异常,因为没有可用的规则来生成它。
英文:
This can be achieved with two separate rules, an input function, and rule order. The idea is that for some wildcards the input function will raise an exception causing the other rule to run.
genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS", "BEL":"BELGIAN BLUE"}
ruleorder:
do_pedigree_with_genome > do_pedigree # always try the genome version first
def do_pedigree_with_genome_input(wildcards):
if wildcards.breed not in genomic_breeds:
raise ValueError() # this breed shouldn't use this rule
return ...
rule do_pedigree_with_genome:
input:
do_pedigree_with_genome_input
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
gen_file=f"../genotypes_{{breed}}.txt"
rule do_pedigree:
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
...
So CHA
will first try to use the genome version of the rule and run because it is one of the genomic_breeds. BEL
will also try the genome version, but the value error will cause it to use the regular version instead.
As a side effect, if you try to consume the gen_file of a breed not in the genomic_breeds, snakemake will raise an exception because no rule is available to produce it.
答案2
得分: 0
我使用两个规则与通配符约束指令相结合找到了这个问题的解决方案。
第一个规则生成两个输出,仅适用于通配符不在 genomic_breeds
字典中的情况。
rule extract_phenotype_data:
wildcard_constraints:
breed='|'.join([breed for breed in breeds.keys() if not breed
in genomic_breeds])
input:
params:
config = "../config_file.yml",
breed =f"{{breed}}"
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
run:
cmd = f"python /../extract_phenotype_data_for_populations.py
--config {params.config} --breed {breeds[params.breed]}"
shell(cmd)
第二个规则也使用通配符约束指令,仅适用于通配符在 genomic_breeds
字典中的情况。
rule extract_phenotype_data_genomic:
wildcard_constraints:
breed='|'.join(list(genomic_breeds.keys()))
input:
params:
config = "../config_file.yml",
breed =f"{{breed}}"
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
gen_file=f"../genotypes_{{breed}}.txt"
run:
cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
shell(cmd)
这两个规则的输出不同,只有一个规则生成 ../genotypes_{{breed}}.txt
文件。
Rule All 可以定义为
rule all:
input:
expand(f"../phen_{{breed}}.txt", breed=breeds.keys()),
expand(f"../listcodeall{{breed}}.txt", breed=breeds.keys()),
expand(f"../genotypes_{{breed}}.txt", breed =genomic_breeds.keys()),
也许这对某一天的某人有所帮助。
英文:
I found a solution to this question by using two rules in combination with the wildcard constraints directive.
The first rule produces two outputs and only works on wildcards not in the genomic_breeds
dictionary.
rule extract_phenotype_data:
wildcard_constraints:
breed='|'.join([breed for breed in breeds.keys() if not breed
in genomic_breeds])
input:
params:
config = "../config_file.yml",
breed =f"{{breed}}"
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
run:
cmd = f"python /../extract_phenotype_data_for_populations.py
--config {params.config} --breed {breeds[params.breed]}"
shell(cmd)
The second rule uses the wildcard constraints directive as well and only works on wildcards in the genomic_breeds
dictionary.
rule extract_phenotype_data_genomic:
wildcard_constraints:
breed='|'.join(list(genomic_breeds.keys()))
input:
params:
config = "../config_file.yml",
breed =f"{{breed}}"
output:
dlistAnim=f"../listcodeall{{breed}}.txt",
phen_file=f"../phen_{{breed}}.txt",
gen_file=f"../genotypes_{{breed}}.txt"
run:
cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
shell(cmd)
The output of both rules are different in that only one of the rules produces the ../genotypes_{{breed}}.txt
file.
Rule All can be defined as
rule all:
input:
expand(f"../phen_{{breed}}.txt", breed=breeds.keys()),
expand(f"../listcodeall{{breed}}.txt", breed=breeds.keys()),
expand(f"../genotypes_{{breed}}.txt", breed =genomic_breeds.keys()),
Perhaps this helps someone someday.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论