如何解决Snakemake中的通配符错误?

huangapple go评论59阅读模式
英文:

How to resolve wildcard error in Snakemake?

问题

抱歉,您提供的内容包含代码和错误信息,我可以帮助您翻译部分文本,但我无法直接解决代码错误。以下是翻译好的内容:

错误信息:

错误:
构建作业的DAG...
在/path/to/pipeline/workflow/Snakefile.py的502行中的WildcardError:
无法从输出文件确定输入文件的通配符:
'anc_r'

这似乎是关于Snakemake规则中的通配符问题。通配符 'anc_r' 无法从输出文件中确定输入文件。这可能需要检查Snakefile规则中的通配符和输入输出文件,确保它们正确匹配。如果需要更多帮助,您可能需要查看Snakefile中的相关规则以找出问题所在。

此外,您提到键盘损坏,如果需要进一步帮助,可以提供更多有关代码和问题的详细信息,我会尽力协助您。

英文:

Error:

Building DAG of jobs...
WildcardErrorin line 502 of /path/to/pipeline/workflow/Snakefile.py:
Wildcards in input files cannot be determined from output files:
'anc_r'

Code:


import os
import json
from datetime import datetime
from glob import iglob
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Define Constants ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# discover input files using path from run config
SAMPLES = list(set(glob_wildcards(f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz").sample))
# read output dir path from run config
OUTPUT_DIR = f"{config['output_dir']}"
# Project name and date for bam header
SEQID='pipeline_align'
config['anc_r'] = list(set(glob_wildcards(f"{config['anc_dir']}/{{anc_r}}_R1_001.fastq.gz").anc_r))
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# https://snakemake.readthedocs.io/en/v7.14.0/tutorial/basics.html#step-7-adding-a-target-rule 
rule all:
input:
f'{OUTPUT_DIR}/DONE.txt'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Set Up Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# export the current run configuration in JSON format
#
rule export_run_config:
output:
path=f"{OUTPUT_DIR}/00_logs/00_run_config.json"
run:
with open(output.path, 'w') as outfile:
json.dump(dict(config), outfile, indent=4)
#
# make a list of discovered samples
#
rule list_samples:
output:
f"{OUTPUT_DIR}/00_logs/00_sample_list.txt"
shell:
"echo -e '{}' > {{output}}".format('\n'.join(SAMPLES))
#
# copy the supplied reference genome fasta to the pipeline output directory for reference
#
rule copy_fasta:
input:
config['ref_fasta']
output:
f"{OUTPUT_DIR}/01_ref_files/{os.path.basename(config['ref_fasta'])}"
shell:
"cp {input} {output}"
rule index_fasta:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.fai"
conda:
'envs/main.yml'
shell:
"samtools faidx {input}"
rule create_ref_dict:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}".rstrip('fasta') + 'dict'
conda:
'envs/main.yml'
shell:
"picard CreateSequenceDictionary -R {input}"
#
# create a BWA index from the copied fasta reference genome
#
rule create_bwa_index:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.amb",
f"{rules.copy_fasta.output}.ann",
f"{rules.copy_fasta.output}.bwt",
f"{rules.copy_fasta.output}.pac",
f"{rules.copy_fasta.output}.sa",
conda:
'envs/main.yml'
shell:
"bwa index {input}"        
#
# Get the ancestor BAM
#
rule align_reads_anc:
input:
rules.create_bwa_index.output,
R1=lambda wildcards: f"{config['anc_r']}/{anc_r}_R1_001.fastq.gz",
R2=lambda wildcards: f"{config['anc_r']}/{anc_r}_R2_001.fastq.gz",
output:
bam=f"{OUTPUT_DIR}/03_init_alignment/{{anc_r}}/{{anc_r}}_R1R2_sort.bam",
conda:
'envs/main.yml'
shell:
r"""bwa mem -R '@RG\tID:""" + SEQID + r"""\tSM:""" + '{wildcards.anc_r}' + r"""\tLB:1'""" + ' {rules.copy_fasta.output} {input.R1} {input.R2} | samtools sort -o {output.bam} - && samtools index {output.bam}'
rule samtools_index_one_anc:
input:
rules.align_reads_anc.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{anc_r}}/{{anc_r}}_R1R2_sort.bam.bai"
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule samtools_flagstat_anc:
input:
bam=rules.align_reads_anc.output,
idx=rules.samtools_index_one_anc.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{anc_r}}/{{anc_r}}_R1R2_sort_flagstat.txt"
conda:
'envs/main.yml'
shell:
"samtools flagstat {input.bam} > {output}"
rule picard_mark_dupes_anc:
input:
bam=rules.align_reads_anc.output,
idx=rules.samtools_index_one_anc.output,
dct=rules.create_ref_dict.output
output:
bam=f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.MD.bam",
metrics=f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.sort_dup_metrix"
conda:
'envs/main.yml'
shell:
"picard MarkDuplicates --INPUT {input.bam} --OUTPUT {output.bam} --METRICS_FILE {output.metrics} --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY LENIENT"
rule picard_read_groups_anc:
input:
rules.picard_mark_dupes_anc.output.bam
output:
f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.sort.bam",
conda:
'envs/main.yml'
shell:
f"picard AddOrReplaceReadGroups --INPUT {{input}} --OUTPUT /dev/stdout --RGID {SEQID} --RGLB 1 --RGPU 1 --RGPL illumina --RGSM {{wildcards.anc_r}} --VALIDATION_STRINGENCY LENIENT | samtools sort -o {{output}} -"
rule samtools_index_two_anc:
input:
rules.picard_read_groups_anc.output
output:
f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.sort.bam.bai",
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule gatk_register:
input:
f'workflow/envs/src/GenomeAnalysisTK-3.7-0-gcfedb67.tar.bz2'
output:
f'{OUTPUT_DIR}/05_gatk/gatk_3.7_registered.txt'
conda:
'envs/main.yml'
shell:
"gatk-register {input} > {output}"
rule gatk_realign_targets_anc:
input:
fa=rules.copy_fasta.output,
bam=rules.picard_read_groups_anc.output,
idx=rules.samtools_index_two_anc.output,
gatk=rules.gatk_register.output,
faidx=rules.index_fasta.output
output:
f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.bam.intervals'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T RealignerTargetCreator -R {input.fa} -I {input.bam} -o {output}"
rule gatk_realign_indels_anc:
input:
fa=rules.copy_fasta.output,
intervals=rules.gatk_realign_targets_anc.output,
bam=rules.picard_read_groups_anc.output,
idx=rules.samtools_index_two_anc.output        
output:
bam=f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.bam',
bai=f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.bai'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T IndelRealigner -R {input.fa} -I {input.bam} -targetIntervals {input.intervals} -o {output.bam}"
rule samtools_sort_three_anc:
input:
rules.gatk_realign_indels_anc.output.bam
output:
f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.sort.bam',
conda:
'envs/main.yml'
shell:
"samtools sort {input} -o {output}"
rule samtools_index_three_anc:
input:
rules.samtools_sort_three_anc.output
output:
f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.sort.bam.bai',
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule index_ancestor_bam:
input:
rules.samtools_sort_three_anc.output
output:
f"{rules.samtools_sort_three_anc.output}.bai"
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule bcftools_pileup_anc:
input:
bam=rules.samtools_sort_three_anc.output,
idx=rules.samtools_index_three_anc.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_samtools_AB.vcf',
conda:
'envs/main.yml'
shell:
"bcftools mpileup --ignore-RG -Ou -ABf {rules.copy_fasta.output} {input.bam} | bcftools call -vmO v -o {output}"
rule freebayes_anc:
input:
bam=rules.samtools_sort_three_anc.output,
idx=rules.samtools_index_three_anc.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_freebayes_BCBio.vcf',
conda:
'envs/main.yml'
shell:
"freebayes -f {rules.copy_fasta.output} --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off --min-alternate-fraction 0.1 {input.bam} > {output}"
rule lofreq_anc:
input:
bam=rules.samtools_sort_three_anc.output,
idx=rules.samtools_index_three_anc.output,
ancidx=rules.index_ancestor_bam.output,
output:
normal=f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_lofreq_normal_relaxed.vcf.gz',
tumor=f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_lofreq_tumor_relaxed.vcf.gz',
somatic=f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_lofreq_somatic_final.snvs.vcf.gz',
conda:
'envs/main.yml'
shell:
f"lofreq somatic -n {{input.bam}} -t {{input.bam}} -f {{rules.copy_fasta.output}} -o {OUTPUT_DIR}/06_variant_calling/{{wildcards.anc_r}}/{{wildcards.anc_r}}_lofreq_"
rule unzip_lofreq_anc:
input:
normal=rules.lofreq_anc.output.normal,
tumor=rules.lofreq_anc.output.tumor,
somatic=rules.lofreq_anc.output.somatic,
output:
normal=rules.lofreq_anc.output.normal.replace('.vcf.gz', '.vcf'),
tumor=rules.lofreq_anc.output.tumor.replace('.vcf.gz', '.vcf'),
somatic=rules.lofreq_anc.output.somatic.replace('.vcf.gz', '.vcf'),
conda:
'envs/main.yml'
shell:
"bgzip -d {input.normal} && bgzip -d {input.tumor} && bgzip -d {input.somatic}"
# ~~~~~~~~~~~~~~~~~~~~~~~~ Perform Initial Alignment ~~~~~~~~~~~~~~~~~~~~~~~~ #
rule align_reads:
input:
rules.create_bwa_index.output,
r1=f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz",
r2=f"{config['fastq_dir']}/{{sample}}_R2_001.fastq.gz",
output:
f"{OUTPUT_DIR}/03_init_alignment/{{sample}}/{{sample}}_R1R2_sort.bam"
conda:
'envs/main.yml'
shell:
r"""bwa mem -R '@RG\tID:""" + SEQID + r"""\tSM:""" + '{wildcards.sample}' + r"""\tLB:1'""" + ' {rules.copy_fasta.output} {input.r1} {input.r2} | samtools sort -o {output} -'
rule samtools_index_one:
input:
rules.align_reads.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{sample}}/{{sample}}_R1R2_sort.bam.bai"
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule samtools_flagstat:
input:
bam=rules.align_reads.output,
idx=rules.samtools_index_one.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{sample}}/{{sample}}_R1R2_sort_flagstat.txt"
conda:
'envs/main.yml'
shell:
"samtools flagstat {input.bam} > {output}"
rule picard_mark_dupes:
input:
bam=rules.align_reads.output,
idx=rules.samtools_index_one.output,
dct=rules.create_ref_dict.output
output:
bam=f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.MD.bam",
metrics=f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.sort_dup_metrix"
conda:
'envs/main.yml'
shell:
"picard MarkDuplicates --INPUT {input.bam} --OUTPUT {output.bam} --METRICS_FILE {output.metrics} --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY LENIENT"
rule picard_read_groups:
input:
rules.picard_mark_dupes.output.bam
output:
f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.RG.MD.sort.bam",
conda:
'envs/main.yml'
shell:
f"picard AddOrReplaceReadGroups --INPUT {{input}} --OUTPUT /dev/stdout --RGID {SEQID} --RGLB 1 --RGPU 1 --RGPL illumina --RGSM {{wildcards.sample}} --VALIDATION_STRINGENCY LENIENT | samtools sort -o {{output}} -"
rule samtools_index_two:
input:
rules.picard_read_groups.output
output:
f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.RG.MD.sort.bam.bai",
conda:
'envs/main.yml'
shell:
"samtools index {input}"
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GATK Re-alignment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
#  configure gatk-3.7 inside conda environment
#
rule gatk_realign_targets:
input:
fa=rules.copy_fasta.output,
bam=rules.picard_read_groups.output,
idx=rules.samtools_index_two.output,
gatk=rules.gatk_register.output,
faidx=rules.index_fasta.output
output:
f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.bam.intervals'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T RealignerTargetCreator -R {input.fa} -I {input.bam} -o {output}"
rule gatk_realign_indels:
input:
fa=rules.copy_fasta.output,
intervals=rules.gatk_realign_targets.output,
bam=rules.picard_read_groups.output,
idx=rules.samtools_index_two.output        
output:
bam=f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.bam',
bai=f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.bai'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T IndelRealigner -R {input.fa} -I {input.bam} -targetIntervals {input.intervals} -o {output.bam}"
rule samtools_sort_three:
input:
rules.gatk_realign_indels.output.bam
output:
f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.sort.bam',
conda:
'envs/main.yml'
shell:
"samtools sort {input} -o {output}"
rule samtools_index_three:
input:
rules.samtools_sort_three.output
output:
f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.sort.bam.bai',
conda:
'envs/main.yml'
shell:
"samtools index {input}"
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Variant Calling ~~~~~~~~~~~~~~~~~~~~~~~~~~ #
rule bcftools_pileup:
input:
bam=rules.samtools_sort_three.output,
idx=rules.samtools_index_three.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_samtools_AB.vcf',
conda:
'envs/main.yml'
shell:
"bcftools mpileup --ignore-RG -Ou -ABf {rules.copy_fasta.output} {input.bam} | bcftools call -vmO v -o {output}"
rule freebayes:
input:
bam=rules.samtools_sort_three.output,
idx=rules.samtools_index_three.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_freebayes_BCBio.vcf',
conda:
'envs/main.yml'
shell:
"freebayes -f {rules.copy_fasta.output} --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off --min-alternate-fraction 0.1 {input.bam} > {output}"
rule lofreq:
input:
bam=rules.samtools_sort_three.output,
idx=rules.samtools_index_three.output,
ancidx=rules.index_ancestor_bam.output,
output:
normal=f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_lofreq_normal_relaxed.vcf.gz',
tumor=f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_lofreq_tumor_relaxed.vcf.gz',
somatic=f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_lofreq_somatic_final.vcf.gz',
conda:
'envs/main.yml'
shell:
f"lofreq somatic -n {{rules.copy_ancestor_bam.output}} -t {{input.bam}} -f {{rules.copy_fasta.output}} -o {{output.somatic}} && "
f"bgzip {{output.normal}} && bgzip {{output.tumor}} && bgzip {{output.somatic}}"
rule unzip_lofreq:
input:
normal=rules.lofreq.output.normal,
tumor=rules.lofreq.output.tumor,
somatic=rules.lofreq.output.somatic,
output:
normal=rules.lofreq.output.normal.replace('.vcf.gz', '.vcf'),
tumor=rules.lofreq.output.tumor.replace('.vcf.gz', '.vcf'),
somatic=rules.lofreq.output.somatic.replace('.vcf.gz', '.vcf'),
conda:
'envs/main.yml'
shell:
"bgzip -d {input.normal} && bgzip -d {input.tumor} && bgzip -d {input.somatic}"
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Filtering Steps ~~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# filter samtools results by ancestor, took normal from the output
#
rule anc_filter_samtools:
input:
sample=rules.bcftools_pileup.output,
anc=rules.unzip_lofreq_anc.output.normal
output:
f'{OUTPUT_DIR}/07_filtered/{{sample}}/{{sample}}_samtools_AB_AncFiltered.vcf',
conda:
'envs/main.yml'
shell:
f"bedtools intersect -v -header -a {{input.sample}} -b {{input.anc}} > {{output}}"

Tried fixing wildcards in previous rules.Same error. Tried altering config dict. Same error. Tried altering previous rules. Same error. Tried altering config file. Same error. Tried looking at documentation. Don't understand I'm doing wrong. Thought it was lofreq rule that was the problem. Same error.

P.S. Broken keyboard.

答案1

得分: 1

抱歉为您带来麻烦,是的,这是一个很长的示例,可能是为什么没有人提供帮助的原因。

将来,您可以使用 --debug-dag 选项来查看 Snakemake 构建 DAG 时使用的路径和通配符。我经常使用该选项来查找错误和 bug。

我还想指出,您可以在所有输出中使用 OUTPUT_DIR,也可以设置一个 workdir 并使其他文件相对于工作目录。

至于您特定的错误,Snakemake 做了一个不错的工作来告诉您问题所在:

/path/to/pipeline/workflow/Snakefile.py的第502行出现WildcardError

第502行是 anc_filter_samtools 规则。在输出中,您使用了一个带有 sample 作为通配符的 vcf 文件,Snakemake 抱怨说:

无法从输出文件中确定输入文件的通配符
'anc_r'

这是在 anc 输入文件中使用的一个通配符。因此,您在输入文件中使用了一个通配符(anc),但在输出文件中没有匹配的通配符。

您可以要么在输出文件名中包含 anc,要么创建一个输入函数将 sample 转换为适当的 anc

在处理规则时,请记住,Snakemake 首先解析输出文件中的所有通配符,然后填充其他指令,如输入文件。

英文:

Sorry for your troubles, and yes this is a long example and probably why no one has helped.

For the future, you can use the --debug-dag option to see the path and wildcards snakemake uses when building the dag. I frequently find bugs/errors with that option.

I also want to point out that instead of using OUTPUT_DIR in all your outputs, you can set a workdir and make your other files relative to the working directory.

For your specific error, snakemake does an ok job telling you what is wrong:

WildcardErrorin line 502 of /path/to/pipeline/workflow/Snakefile.py:

Line 502 is the anc_filter_samtools rule. As outputs you have a vcf with sample as the wildcard and snakemake complains

Wildcards in input files cannot be determined from output files:
'anc_r'

Which is a wildcard used in the anc input file. So you have a wildcard in an input file (anc) without a matching wildcard in the output file.

You either need to include anc in the output file name or create an input function to convert the sample to the appropriate anc.

When you work through rules, remember that snakemake starts by resolving all wildcards in the output file then fills in the other directives, like inputs.

huangapple
  • 本文由 发表于 2023年3月3日 18:14:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75625747.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定