Snakemake更改通配符,导致InputFunctionException。

huangapple go评论51阅读模式
英文:

Snakemake changes wildcard, resulting in InputFunctionException

问题

  1. 错误1:

    错误信息表明 wildcard 使用了 prrsv12_qcpass,但是 wildcard 应该是 prrsv12。而且 prrsv12_qcpassapply_qc 规则的输出文件名。

  2. 错误2:

    错误信息中指出 typing 的 wildcard 错误,但 apply_qc 的 wildcard 正确。此外,typing 预期的输入应该是 output/consensus/prrsv12.fasta 而不是 output/consensus/prrsv12_qcpass.fasta

你尝试过使用 rules.<rule>.output 语法和添加 ruleorder 来解决 AmbiguousRuleException 错误,但仍然遇到问题。对于第一个错误,似乎是 wildcard 的问题,但你不清楚为什么会在 wildcard 中添加 _qcpass。有时候这个错误会随机出现。

在运行时使用 --debug-dag 时,看到 _qcpass 被添加到了 wildcard 中,但对于 apply_qc 规则似乎没有问题,尽管它是管道中的最后一个规则。

请问你还需要关于这些错误的进一步帮助吗?

英文:

I keep getting the same errors at the same step in the pipeline. I have 2 rules named typing and apply_qc which somehow conflict. typing uses outputs from another rule, polish_consensus, and apply_qc uses the outputs of typing (so the order: polish_consensus &gt; typing &gt; apply_qc). The outputs of typing are a fasta and CSV file. apply_qc is a quality control step, which will censor the data of these files when of low quality. Now I keep getting the same errors with the rules:

The code:

rule typing:
    input:
        f&quot;{DATA_FOLDER}/vaccines.fasta&quot;,
        rules.polish_consensus.output
    output:
        temp(f&quot;{OUTPUT_FOLDER}/typing/{{samplename}}.csv&quot;)
    script:
        &quot;../scripts/typing.py&quot;

rule apply_qc:
    input:
        rules.typing.output,
        rules.polish_consensus.output,
        rules.featurecounts.output.summary
    output:
        typing=f&quot;{OUTPUT_FOLDER}/typing/{{samplename}}_qcpass.csv&quot;,
        consensus=f&quot;{OUTPUT_FOLDER}/consensus/{{samplename}}_qcpass.fasta&quot;
    script:
        &quot;../scripts/apply_qc.py&quot;

The output of the rule polish_consensus is output/consensus/{samplename}.fasta with samplename=prrsv12.

The error:

InputFunctionException in rule typing in file /home/lisah/Pycharm/minor-HTHPC/snakemake/workflow/rules/typing.smk, line 1:
Error:
  KeyError: &#39;prrsv12_qcpass&#39;
Wildcards:
  samplename=prrsv12_qcpass
Traceback:
  File &quot;/home/lisah/Pycharm/minor-HTHPC/snakemake/workflow/rules/typing.smk&quot;, line 12, in &lt;lambda&gt;

The error shows that the wildcard used is prrsv12_qcpass, but the wildcard is prrsv12 + prrsv12_qcpass is the filename of an output of the apply_qc rule.

  1. The second error is something I hope I already fixed, but it shows more info than the previous error:
AmbiguousRuleException:
Rules apply_qc and typing are ambiguous for the file output/typing/prrsv12_qcpass.csv.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
        apply_qc: samplename=prrsv12
        typing: samplename=prrsv12_qcpass
Expected input files:
        apply_qc: output/typing/prrsv12.csv output/consensus/prrsv12.fasta output/counts/prrsv12_summary.csv
        typing: data/prrsv/vaccines.fasta output/consensus/prrsv12_qcpass.fasta
Expected output files:
        apply_qc: output/typing/prrsv12_qcpass.csv output/consensus/prrsv12_qcpass.fasta
        typing: output/typing/prrsv12_qcpass.csv

As said before, the wildcard for typing is wrong, but the wildcard for apply_qc is correct (?????). Likewise, the expected input for typing is not output/consensus/prrsv12_qcpass.fasta but output/consensus/prrsv12.fasta.

I hoped I fixed the AmbiguousRuleException by using the rules.&lt;rule&gt;.output syntax and adding a ruleorder. As for the first error, I am completely lost and have no idea why this happens. It seems like an error with the wildcard, but I have no idea how the _qcpass part is added to the wildcard. It also seems like this error happens at random: Some runs work fine and others it crashes into this (Yes, run with the same data).

EDIT:

I tried running it with the --debug-dag and the only thing that popped out is the following:

selected job readcap
    wildcards: samplename=prrsv20_qcpass
file output/fastq/prrsv20_qcpass_readcap.fastq.gz:
    Producer found, hence exceptions are ignored.

candidate job select_centroid
    wildcards: samplename=prrsv20_qcpass
candidate job featurecounts
    wildcards: samplename=prrsv20_qcpass
candidate job map2ref
    wildcards: samplename=prrsv20_qcpass
candidate job apply_qc
    wildcards: samplename=prrsv20
selected job apply_qc
    wildcards: samplename=prrsv20

The _qcpass is added to the wildcard for the rest of the pipeline, but seems to work fine for apply_qc? apply_qc is one of the last rules in the pipeline...

答案1

得分: 0

以下是您提供的代码的翻译:

听起来你误解了`snakemake`中模糊规则的含义,以及为什么应该避免它们以及为什么`ruleorder`不能解决你的问题。

首先,这是一个MWE - 一个可以重现你的问题的最小工作示例。请注意,如果你提供这样一个示例和用于运行`snakemake`的调用,对于每个人来说都会更容易。

在这种情况下,可以通过调用`snakemake -call`来重现问题:

```python
rule polish_consensus:
    output:
        "consensus/{samplename}.fasta",
    shell:
        """
        echo polish_consensus > {output[0]}
        """


rule typing:
    input:
        rules.polish_consensus.output,
    output:
        "typing/{samplename}.csv",
    shell:
        """
        cat {input[0]} > {output[0]}
        """


rule apply_qc:
    input:
        rules.typing.output,
        rules.polish_consensus.output,
    output:
        typing="typing/{samplename}_qcpass.csv",
        consensus="onsensus/{samplename}_qcpass.fasta",
    shell:
        """
        echo qcpass > {output[0]}
        echo qcpass > {output[1]}
        """


rule all:
    default_target: True
    input:
        expand(rules.apply_qc.output[0], samplename="prrsv12"),
        expand(rules.apply_qc.output[1], samplename="prrsv12"),

你的通配符{samplename}将与你请求的所有输出文件以及snakemake运行工作流所必须生成的文件匹配。

现在请求typing/prrsv12_qcpass.csv将匹配具有samplename=prrsv12rule apply_qc的输出,以及具有samplename=prrsv12_qcpassrule typing的输出。为了防止这种情况发生,你应该限制你的通配符,而不是尝试使用ruleorder或使用对rules.<name>.output的引用。

通过使用wildcard_constraint,你告诉snakemake通配符可以匹配哪些字符串。在你的情况下,你的samplename可能永远不会包含下划线,也就是说你可以使用:

wildcard_constraints:
    samplename="[a-zA-Z0-9]+",

告诉snakemake匹配小写/大写字母和数字0-9,但不包括任何空格或下划线之类的其他符号。这将使snakemake永远不会将prrsv12_qcpass考虑为samplefile的通配符值,而只会将prsv12作为通配符值,将_qcpass作为文件名的附加固定部分。

有关wildcard_constraints的更多信息可以在文档中找到。

将所有内容放在单个Snakefile中:

wildcard_constraints:
    samplename="[a-zA-Z0-9]+",

rule polish_consensus:
    output:
        "consensus/{samplename}.fasta",
    shell:
        """
        echo polish_consensus > {output[0]}
        """


rule typing:
    input:
        rules.polish_consensus.output,
    output:
        "typing/{samplename}.csv",
    shell:
        """
        cat {input[0]} > {output[0]}
        """


rule apply_qc:
    input:
        rules.typing.output,
        rules.polish_consensus.output,
    output:
        typing="typing/{samplename}_qcpass.csv",
        consensus="onsensus/{samplename}_qcpass.fasta",
    shell:
        """
        echo qcpass > {output[0]}
        echo qcpass > {output[1]}
        """


rule all:
    default_target: True
    input:
        expand(rules.apply_qc.output[0], samplename="prrsv12"),
        expand(rules.apply_qc.output[1], samplename="prrsv12"),

希望这对你有所帮助。如果你有任何其他问题,请随时问。

<details>
<summary>英文:</summary>

It sounds like you misunderstood what ambiguous rules mean for `snakemake`, why you should avoid them and why `ruleorder` will not solve your problem.

First of all, here&#39;s a MWE - a minimal working example which reproduces your issue. Note that it is easier for everyone if you provide such an example and the call used to run `snakemake`.

In this case, the problem can be reproduced by calling `snakemake -call`:

```python
rule polish_consensus:
    output:
        &quot;consensus/{samplename}.fasta&quot;,
    shell:
        &quot;&quot;&quot;
        echo polish_consensus &gt; {output[0]}
        &quot;&quot;&quot;


rule typing:
    input:
        rules.polish_consensus.output,
    output:
        &quot;typing/{samplename}.csv&quot;,
    shell:
        &quot;&quot;&quot;
        cat {input[0]} &gt; {output[0]}
        &quot;&quot;&quot;


rule apply_qc:
    input:
        rules.typing.output,
        rules.polish_consensus.output,
    output:
        typing=&quot;typing/{samplename}_qcpass.csv&quot;,
        consensus=&quot;onsensus/{samplename}_qcpass.fasta&quot;,
    shell:
        &quot;&quot;&quot;
        echo qcpass &gt; {output[0]}
        echo qcpass &gt; {output[1]}
        &quot;&quot;&quot;


rule all:
    default_target: True
    input:
        expand(rules.apply_qc.output[0], samplename=&quot;prrsv12&quot;),
        expand(rules.apply_qc.output[1], samplename=&quot;prrsv12&quot;),

Your wildcard {samplename} will be matched by snakemake against all the output-files you request as well as files snakemake has to generate to run the workflow.

Now requesting typing/prrsv12_qcpass.csv matches the output of rule apply_qc with samplename=prrsv12 as well as rule typing with samplename=prrsv12_qcpass. To prevent this you should constrain your wildcard rather than trying a ruleorder or using references to a rules.&lt;name&gt;.output.

By using a wildcard_constraint you tell snakemake which strings a wildcard can match. In your case, your samplename is presumably never going to contain an underscore, i.e. you can use:

wildcard_constraints:
    samplename=&quot;[a-zA-Z0-9]+&quot;,

to tell snakemake to match against small/capital letters an numbers from 0-9, but not any whitespace or other symbols like underscore. This will make snakemake never consider prrsv12_qcpass as the wildcard value for samplefile, but only prsv12 as the wildcard and _qcpass as an additional, fixed part of the filename.

More on wildcard_constraints can be found in the documentation

Putting everything together into a single Snakefile:

wildcard_constraints:
    samplename=&quot;[a-zA-Z0-9]+&quot;,

rule polish_consensus:
    output:
        &quot;consensus/{samplename}.fasta&quot;,
    shell:
        &quot;&quot;&quot;
        echo polish_consensus &gt; {output[0]}
        &quot;&quot;&quot;


rule typing:
    input:
        rules.polish_consensus.output,
    output:
        &quot;typing/{samplename}.csv&quot;,
    shell:
        &quot;&quot;&quot;
        cat {input[0]} &gt; {output[0]}
        &quot;&quot;&quot;


rule apply_qc:
    input:
        rules.typing.output,
        rules.polish_consensus.output,
    output:
        typing=&quot;typing/{samplename}_qcpass.csv&quot;,
        consensus=&quot;onsensus/{samplename}_qcpass.fasta&quot;,
    shell:
        &quot;&quot;&quot;
        echo qcpass &gt; {output[0]}
        echo qcpass &gt; {output[1]}
        &quot;&quot;&quot;


rule all:
    default_target: True
    input:
        expand(rules.apply_qc.output[0], samplename=&quot;prrsv12&quot;),
        expand(rules.apply_qc.output[1], samplename=&quot;prrsv12&quot;),

huangapple
  • 本文由 发表于 2023年5月17日 21:49:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76272859.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定