2023年3月31日 20:40:55go评论96阅读模式

英文:

Snakemake expand on a dictionary, keeping wildcards

问题

我有一个类似以下的字典：
```python
data = {
    "group1": ["a", "b", "c"], 
    "group2": ["x", "y", "z"]
}

我想要使用 expand 来分别获取键和它们的值之间的所有组合 在 "rule all" 中，使得期望的输出文件例如是 "group1/a.txt"， "group1/b.txt"，... "group2/x.txt"， "group2/y.txt" ...

rule all: 
    input: 
        expand("{group}/{sub_group}.txt", group = ???, sub_group = ???)

我需要在 "some_rule" 规则中用到这个：

rule some_rule: 
    input: "single_input_file.txt"
    output: "{group}/{sub_group}.txt"
    params: 
        group=group, # 我怎样提取这些占位符？
        sub_group=sub_group
    script: 
        "some_script.R"

我需要使用 group 和 sub_group 通配符是因为我需要将它们传递给规则 "some_rule" 的 params。

我尝试在 "rule all" 中硬编码所有需要的输出文件，使用列表推导式，但是占位符在通配符中没有定义，我无法将它们传递给 params。

所以我想我需要使用 expand 定义 "rule all" 的输入文件，但是我不知道如何获取正确的文件，因为我需要在 "group1" 和它的值以及 "group2" 和它的值之间分别执行组合。

我也不能在规则 "some_rule" 中使用一个输入函数，因为它只有一个固定的静态输入文件。

在 StackOverflow 上的其他类似问题中，要么没有组合问题，要么他们使用纯python创建 "rule_all" 的输入文件，这使我失去了通配符。


<details>
<summary>英文:</summary>
I have a dictionary like the following: 
```python
data = {
    &quot;group1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;], 
    &quot;group2&quot;: [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]
}

I want to use expand to get all combinations between the keys and their values separately in "rule all", s.t. the expected output files are e.g. "group1/a.txt", "group1/b.txt", ... "group2/x.txt, "group2/y.txt" ...

rule all: 
    input: 
        expand(&quot;{group}/{sub_group}.txt&quot;, group = ???, sub_group = ???)

I need this for the rule "some_rule":


rule some_rule: 
    input: &quot;single_input_file.txt&quot;
    output: &quot;{group}/{sub_group}.txt&quot;
    params: 
        group=group, # how do I extract these placeholders?
        sub_group=sub_group
    script: 
        &quot;some_script.R&quot;

The reason why I need to have group and sub_group wildcards is because I need to pass them to the params of rule "some_rule"

I tried to hardcode all output files needed in the "rule all" with list comprehension, but then the placeholders are not defined in the wildcards and I cannot pass them to the params.

So I guess I need to define the "rule all" input files using expand, but here I don't know how to get the correct files, as I need the combinations to be performed individually between "group1" and its values and "group2" and its values.

I also cannot use an input function for the rule "some_rule", as it has only one singular static input file.

In other similar questions on StackOverflow, either there is not the combinatorial problem, or they create the input files for "rule_all" using plain python, which makes me loose the wildcards.

答案1

得分: 2

import pandas as pd
data = {
    "group1": ["a", "b", "c"],
    "group2": ["x", "y", "z"]
}
df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
                  columns=['Group', 'Value'])
rule all:
    input:
        expand("{group}/{sub_group}.txt", zip, group=df['Group'], sub_group=df['Value'])
rule some_rule:
    output: "{group}/{sub_group}.txt"
    params:
        group='{group}',
        sub_group='{sub_group}'
    shell:
        """
        echo {params.group} {params.sub_group} > {output}
        """

英文:

Answer based on your comment.

import pandas as pd
data = {
    &quot;group1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;],
    &quot;group2&quot;: [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]
}
df = pd.DataFrame([(k, v) for k, vs in data.items() for v in vs],
                  columns=[&#39;Group&#39;, &#39;Value&#39;])
rule all:
    input:
        expand(&quot;{group}/{sub_group}.txt&quot;, zip, group=df[&#39;Group&#39;], sub_group=df[&#39;Value&#39;])
rule some_rule:
    output: &quot;{group}/{sub_group}.txt&quot;
    params:
        group=&#39;{group}&#39;,
        sub_group=&#39;{sub_group}&#39;
    shell:
        &quot;&quot;&quot;
        echo {params.group} {params.sub_group} &gt; {output}
        &quot;&quot;&quot;

答案2

得分: 1

你可以使用以下代码：

rule some_rule: 
    input: "single_input_file.txt"
    output: "{group}/{sub_group}.txt"
    script: 
        "some_script.R"

并通过例如 snakemake@wildcards[['group']] 在 R 脚本内访问通配符 {group} 和 {sub_group} 的值（未经测试，但我认为应该可以）。

或者你可以使用以下方式：

params:
    group='{group}'
    sub_group='{sub_group}'

英文:

You can use this:

rule some_rule: 
    input: &quot;single_input_file.txt&quot;
    output: &quot;{group}/{sub_group}.txt&quot;
    script: 
        &quot;some_script.R&quot;

and access the value of the wildcards {group} and {subgroups} inside the R
script with e.g. snakemake@wildcards[['group']] (not tested but I think it
should do it).

Alternatively I think you could have:

params:
    group=&#39;{group}&#39;
    sub_group=&#39;{sub_group}&#39;,

答案3

得分: 1

我找到了一个解决方案来解决我的问题，使用了一个自定义组合器函数。

def pairwise_product(*args):
    result = []
    for group, sub_group in zip(*args):
        sub_group = ([sub_group[0]], sub_group[1])
        for sub_sub_group in itertools.product(*sub_group):
            result.append((group, sub_sub_group))
    return result

查看了 snakemake 的扩展函数源代码，我意识到我可以使用自己的组合器函数。

pairwise_product 期望输入两个元组列表，其中每个元组包含通配符名称和通配符值，例如：

wildcard1 = [("group", "group1"), ("group", "group2")]
wildcard2 = [("sub_group", ["a", "b", "c"]), ("sub_group", ["x", "y", "z"])]
pairwise_product(wildcard1, wildcard2)

此函数调用的输出将是：

[(('group', 'group1'), ('sub_group', 'a')),
 (('group', 'group1'), ('sub_group', 'b')),
 (('group', 'group1'), ('sub_group', 'c')),
 (('group', 'group2'), ('sub_group', 'x')),
 (('group', 'group2'), ('sub_group', 'y')),
 (('group', 'group2'), ('sub_group', 'z'))]

而扩展函数的输出将是：

expand("{group}/{sub_group}.txt", pairwise_product, group=data.keys(), sub_group=data.values())
['group1/a.txt',
 'group1/b.txt',
 'group1/c.txt',
 'group2/x.txt',
 'group2/y.txt',
 'group2/z.txt']

通过这种解决方案，我也获得了我想要的通配符，即每个字典键的列表值中的单独元素。

请注意，此函数仅设计用于上述在 data 字典中以示例显示的两个通配符格式，并未针对其他格式进行测试。

英文:

I found a solution for my problem using a custom combinator function.

def pairwise_product(*args):
result = []
for group, sub_group in zip(*args):
    sub_group = ([sub_group[0]], sub_group[1])
    for sub_sub_group in itertools.product(*sub_group):
        result.append((group, sub_sub_group))
return result

Looking at the source code for snakemake's expand function, I realized that I can use my own combinator function.

pairwise_product expects as input two lists of tuples, where each tuple contains the wildcard name and the wildcard value, e.g.

wildcard1 = [(&quot;group&quot;, &quot;group1&quot;), (&quot;group&quot;, &quot;group2&quot;)]
wildcard2 = [(&quot;sub_group&quot;, [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;]), (&quot;sub_group&quot;, [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;])]
pairwise_product(wildcard1, wildcard2)

The output of this function call would be:

[((&#39;group&#39;, &#39;group1&#39;), (&#39;sub_group&#39;, &#39;a&#39;)),
 ((&#39;group&#39;, &#39;group1&#39;), (&#39;sub_group&#39;, &#39;b&#39;)),
 ((&#39;group&#39;, &#39;group1&#39;), (&#39;sub_group&#39;, &#39;c&#39;)),
 ((&#39;group&#39;, &#39;group2&#39;), (&#39;sub_group&#39;, &#39;x&#39;)),
 ((&#39;group&#39;, &#39;group2&#39;), (&#39;sub_group&#39;, &#39;y&#39;)),
 ((&#39;group&#39;, &#39;group2&#39;), (&#39;sub_group&#39;, &#39;z&#39;))]

And the output of the expand function would be:

expand(&quot;{group}/{sub_group}.txt&quot;, pairwise_product, group=data.keys(), sub_group=data.values())
[&#39;group1/a.txt&#39;,
 &#39;group1/b.txt&#39;,
 &#39;group1/c.txt&#39;,
 &#39;group2/x.txt&#39;,
 &#39;group2/y.txt&#39;,
 &#39;group2/z.txt&#39;]

With this solution I also get the wildcards I want, i.e. the individual elements in the list-values for each dictionary key separately.

Note that this function has been designed for only two wildcards in the format as shown above in the data dictionary and not tested for other formats.

答案4

得分: 0

你可以使用嵌套的列表推导式

data = {
    "group1": ["a", "b", "c"],
    "group2": ["x", "y", "z"]
}
files = sum(
    [
        [f"{key}/{value}.txt" for value in values] for key, values in data.items()
    ],
    []
)
print(files)

我认为你计划在这些文件中运行一个程序。如果是这样的话：

for file in files:
     # 在`file`上运行脚本

英文:

You can use nested list comprehensions

data = {
    &quot;group1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;], 
    &quot;group2&quot;: [&quot;x&quot;, &quot;y&quot;, &quot;z&quot;]
}
files = sum(
    [
        [f&quot;{key}/{value}.txt&quot; for value in values] for key,values in data.items()],
    []
)
print(files)

I think you are planning to then run a program on each of the files? If so:

for file in files:
     # run script on `file`
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Snakemake在一个字典上展开，保留通配符。

问题

答案1

答案2

答案3

答案4

You can use nested list comprehensions

不同语言编写的ZeroMQ套接字的兼容性

如何解决Python中的“没有Crypto包”错误？

将多个列合并为一个列在 pandas 中

Python 3.9 使用 OR | 运算符表示联合类型吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。