Snakemake在一个字典上展开,保留通配符。

huangapple go评论54阅读模式
英文:

Snakemake expand on a dictionary, keeping wildcards

问题

我有一个类似以下的字典

```python
data = {
    "group1": ["a", "b", "c"], 
    "group2": ["x", "y", "z"]
}

我想要使用 expand 来分别获取键和它们的值之间的所有组合 在 "rule all" 中,使得期望的输出文件例如是 "group1/a.txt", "group1/b.txt",... "group2/x.txt", "group2/y.txt" ...

rule all: 
    input: 
        expand("{group}/{sub_group}.txt", group = ???, sub_group = ???)

我需要在 "some_rule" 规则中用到这个:

rule some_rule: 
    input: "single_input_file.txt"
    output: "{group}/{sub_group}.txt"
    params: 
        group=group, # 我怎样提取这些占位符?
        sub_group=sub_group
    script: 
        "some_script.R"

我需要使用 groupsub_group 通配符是因为我需要将它们传递给规则 "some_rule" 的 params

我尝试在 "rule all" 中硬编码所有需要的输出文件,使用列表推导式,但是占位符在通配符中没有定义,我无法将它们传递给 params。

所以我想我需要使用 expand 定义 "rule all" 的输入文件,但是我不知道如何获取正确的文件,因为我需要在 "group1" 和它的值以及 "group2" 和它的值之间分别执行组合

我也不能在规则 "some_rule" 中使用一个输入函数,因为它只有一个固定的静态输入文件。

在 StackOverflow 上的其他类似问题中,要么没有组合问题,要么他们使用纯python创建 "rule_all" 的输入文件,这使我失去了通配符。


<details>
<summary>英文:</summary>


I have a dictionary like the following: 

```python
data = {
    &quot;group1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;], 
    &quot;group2&quot;: [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]
}

I want to use expand to get all combinations between the keys and their values separately in "rule all", s.t. the expected output files are e.g. "group1/a.txt", "group1/b.txt", ... "group2/x.txt, "group2/y.txt" ...

rule all: 
    input: 
        expand(&quot;{group}/{sub_group}.txt&quot;, group = ???, sub_group = ???)

I need this for the rule "some_rule":


rule some_rule: 
    input: &quot;single_input_file.txt&quot;
    output: &quot;{group}/{sub_group}.txt&quot;
    params: 
        group=group, # how do I extract these placeholders?
        sub_group=sub_group
    script: 
        &quot;some_script.R&quot;

The reason why I need to have group and sub_group wildcards is because I need to pass them to the params of rule "some_rule"

I tried to hardcode all output files needed in the "rule all" with list comprehension, but then the placeholders are not defined in the wildcards and I cannot pass them to the params.

So I guess I need to define the "rule all" input files using expand, but here I don't know how to get the correct files, as I need the combinations to be performed individually between "group1" and its values and "group2" and its values.

I also cannot use an input function for the rule "some_rule", as it has only one singular static input file.

In other similar questions on StackOverflow, either there is not the combinatorial problem, or they create the input files for "rule_all" using plain python, which makes me loose the wildcards.

答案1

得分: 2

import pandas as pd

data = {
    "group1": ["a", "b", "c"],
    "group2": ["x", "y", "z"]
}

df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
                  columns=['Group', 'Value'])

rule all:
    input:
        expand("{group}/{sub_group}.txt", zip, group=df['Group'], sub_group=df['Value'])

rule some_rule:
    output: "{group}/{sub_group}.txt"
    params:
        group='{group}',
        sub_group='{sub_group}'
    shell:
        """
        echo {params.group} {params.sub_group} > {output}
        """
英文:

Answer based on your comment.

import pandas as pd

data = {
    &quot;group1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;],
    &quot;group2&quot;: [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]
}

df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
                  columns=[&#39;Group&#39;, &#39;Value&#39;])

rule all:
    input:
        expand(&quot;{group}/{sub_group}.txt&quot;, zip, group=df[&#39;Group&#39;], sub_group=df[&#39;Value&#39;])

rule some_rule:
    output: &quot;{group}/{sub_group}.txt&quot;
    params:
        group=&#39;{group}&#39;,
        sub_group=&#39;{sub_group}&#39;
    shell:
        &quot;&quot;&quot;
        echo {params.group} {params.sub_group} &gt; {output}
        &quot;&quot;&quot;

答案2

得分: 1

你可以使用以下代码:

rule some_rule: 
    input: "single_input_file.txt"
    output: "{group}/{sub_group}.txt"
    script: 
        "some_script.R"

并通过例如 snakemake@wildcards[['group']] 在 R 脚本内访问通配符 {group}{sub_group} 的值(未经测试,但我认为应该可以)。

或者你可以使用以下方式:

params:
    group='{group}'
    sub_group='{sub_group}'
英文:

You can use this:

rule some_rule: 
    input: &quot;single_input_file.txt&quot;
    output: &quot;{group}/{sub_group}.txt&quot;
    script: 
        &quot;some_script.R&quot;

and access the value of the wildcards {group} and {subgroups} inside the R
script with e.g. snakemake@wildcards[[&#39;group&#39;]] (not tested but I think it
should do it).

Alternatively I think you could have:

params:
    group=&#39;{group}&#39;
    sub_group=&#39;{sub_group}&#39;,

答案3

得分: 1

我找到了一个解决方案来解决我的问题,使用了一个自定义组合器函数

def pairwise_product(*args):
    result = []
    for group, sub_group in zip(*args):
        sub_group = ([sub_group[0]], sub_group[1])
        for sub_sub_group in itertools.product(*sub_group):
            result.append((group, sub_sub_group))
    return result

查看了 snakemake 的扩展函数源代码,我意识到我可以使用自己的组合器函数。

pairwise_product 期望输入两个元组列表,其中每个元组包含通配符名称和通配符值,例如:

wildcard1 = [("group", "group1"), ("group", "group2")]
wildcard2 = [("sub_group", ["a", "b", "c"]), ("sub_group", ["x", "y", "z"])]
pairwise_product(wildcard1, wildcard2)

此函数调用的输出将是:

[(('group', 'group1'), ('sub_group', 'a')),
 (('group', 'group1'), ('sub_group', 'b')),
 (('group', 'group1'), ('sub_group', 'c')),
 (('group', 'group2'), ('sub_group', 'x')),
 (('group', 'group2'), ('sub_group', 'y')),
 (('group', 'group2'), ('sub_group', 'z'))]

扩展函数的输出将是:

expand("{group}/{sub_group}.txt", pairwise_product, group=data.keys(), sub_group=data.values())

['group1/a.txt',
 'group1/b.txt',
 'group1/c.txt',
 'group2/x.txt',
 'group2/y.txt',
 'group2/z.txt']

通过这种解决方案,我也获得了我想要的通配符,即每个字典键的列表值中的单独元素。

请注意,此函数仅设计用于上述在 data 字典中以示例显示的两个通配符格式,并未针对其他格式进行测试。

英文:

I found a solution for my problem using a custom combinator function.

def pairwise_product(*args):
result = []
for group, sub_group in zip(*args):
    sub_group = ([sub_group[0]], sub_group[1])
    for sub_sub_group in itertools.product(*sub_group):
        result.append((group, sub_sub_group))
return result

Looking at the source code for snakemake's expand function, I realized that I can use my own combinator function.

pairwise_product expects as input two lists of tuples, where each tuple contains the wildcard name and the wildcard value, e.g.

wildcard1 = [(&quot;group&quot;, &quot;group1&quot;), (&quot;group&quot;, &quot;group2&quot;)]
wildcard2 = [(&quot;sub_group&quot;, [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;]), (&quot;sub_group&quot;, [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;])]
pairwise_product(wildcard1, wildcard2)

The output of this function call would be:

[((&#39;group&#39;, &#39;group1&#39;), (&#39;sub_group&#39;, &#39;a&#39;)),
 ((&#39;group&#39;, &#39;group1&#39;), (&#39;sub_group&#39;, &#39;b&#39;)),
 ((&#39;group&#39;, &#39;group1&#39;), (&#39;sub_group&#39;, &#39;c&#39;)),
 ((&#39;group&#39;, &#39;group2&#39;), (&#39;sub_group&#39;, &#39;x&#39;)),
 ((&#39;group&#39;, &#39;group2&#39;), (&#39;sub_group&#39;, &#39;y&#39;)),
 ((&#39;group&#39;, &#39;group2&#39;), (&#39;sub_group&#39;, &#39;z&#39;))]

And the output of the expand function would be:

expand(&quot;{group}/{sub_group}.txt&quot;, pairwise_product, group=data.keys(), sub_group=data.values())

[&#39;group1/a.txt&#39;,
 &#39;group1/b.txt&#39;,
 &#39;group1/c.txt&#39;,
 &#39;group2/x.txt&#39;,
 &#39;group2/y.txt&#39;,
 &#39;group2/z.txt&#39;]

With this solution I also get the wildcards I want, i.e. the individual elements in the list-values for each dictionary key separately.

Note that this function has been designed for only two wildcards in the format as shown above in the data dictionary and not tested for other formats.

答案4

得分: 0

你可以使用嵌套的列表推导式

data = {
    "group1": ["a", "b", "c"],
    "group2": ["x", "y", "z"]
}

files = sum(
    [
        [f"{key}/{value}.txt" for value in values] for key, values in data.items()
    ],
    []
)

print(files)

我认为你计划在这些文件中运行一个程序。如果是这样的话:

for file in files:
     # 在`file`上运行脚本
英文:

You can use nested list comprehensions

data = {
    &quot;group1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;], 
    &quot;group2&quot;: [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]
}

files = sum(
    [
        [f&quot;{key}/{value}.txt&quot; for value in values] for key,values in data.items()],
    []
)

print(files)

I think you are planning to then run a program on each of the files? If so:

for file in files:
     # run script on `file`

</details>



huangapple
  • 本文由 发表于 2023年3月31日 20:40:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75898649.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定