英文:
How to correctly define multiple sets of values as config values in snakemake
问题
我有一个Snakefile,正在循环遍历一组分离物(isolates)来运行2个Python脚本。我想将一组值设置为脚本标志,但我不知道如何在没有Bash脚本的情况下实现这一点。
我使用以下命令来运行它:
snakemake --config window_size=1000000 slide_size=10000 mean_filter_size=99 --cores=9
如何为 --config 设置多组值?
例如,1000000, 10000, 99
1000000, 100, 66 或其他等等。
请注意,我只会回答关于翻译的请求,不会回答其他问题。
英文:
I have a snakefile that is looping through a set of isolates to run 2 python scripts. I want to set a set of values as the script flags but I do not know how to do it without a bash script.
import os
grouphome = os.environ['GROUPHOME']
ISOLATES = [i for i in open(grouphome+'/isolates.txt').read().split('\n') if len(i) > 0]
window_size=config["window_size"]
slide_size=config["slide_size"]
mean_filter_size =config["mean_filter_size"]
rule all:
input:
expand(grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv", isolate=ISOLATES,window_size=config["window_size"],slide_size=config["slide_size"],mean_filter_size=config["mean_filter_size"]),
expand(grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/ptr.csv", isolate=ISOLATES,window_size=config["window_size"],slide_size=config["slide_size"],mean_filter_size=config["mean_filter_size"]),
expand(grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/residuals_fitted_values.csv",isolate=ISOLATES,window_size=config["window_size"],slide_size=config["slide_size"],mean_filter_size=config["mean_filter_size"])
rule coverage_preprocessing:
input:
input1= grouphome+"/data/{isolate}.fasta",
input2= grouphome+"/max_coverage/{isolate}_maximum_coverage.csv"
params:
window_size=config["window_size"],
slide_size=config["slide_size"],
mean_filter_size =config["mean_filter_size"]
output: grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv"
shell:
"""./preprocess_coverage.py -i {input.input2} -g {input.input1} -sw {params.slide_size} -ws {params.window_size} -m mean -bm mean -bs {params.mean_filter_size} -o {output}"""
rule calc_bptr:
params:
window_size=config["window_size"],
slide_size=config["slide_size"],
mean_filter_size =config["mean_filter_size"]
input:
input1= grouphome+"/data/{isolate}.fasta",
input2= grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/{isolate}.csv"
output:
output1= grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/ptr.csv",
output2= grouphome+"/preprocessed_coverage_{window_size}_{slide_size}_{mean_filter_size}/ptr/{isolate}/residuals_fitted_values.csv",
shell:
"""./calculate_bptr.py -i {input.input2} -b mean -s {params.mean_filter_size} -g {input.input1} -f {output.output2} -o {output.output1}"""
I am using the following command to run it
snakemake --config window_size=1000000 slide_size=10000 mean_filter_size=99 --cores=9
How do I set multiple sets of values for --config?
e.g., 1000000,10000,99
1000000,100,66 or etc.
答案1
得分: 1
有多个问题需要解决。
首先,我更喜欢将配置文件传递给 Snakemake 调用,例如:snakemake --config config.yaml
,其中 config.yaml
包含参数的值。在那里,您可以为参数指定值的列表,例如:
window_size:
- 1000
- 2000
slide_size:
- 5
- 10
第二个问题是如何使用参数值列表运行规则。我建议在 all
中为所有参数值指定输出,然后下游规则从通配符中获取参数值。
# 此规则请求这两个参数的所有组合,您可以将其作为从配置中传递的列表
rule all:
input:
expand(
"results/{window_size}_{slide_size}.csv",
window_size=config["window_size"],
slide_size=config["slide_size"]
),
# 此规则从输出参数值中解析,可以使用通配符对象访问这些参数
rule preprocess_coverage:
input:
...,
output:
"results/{window_size}_{slide_size}.csv"
shell:
"... -sw {wildcards.slide_size} -ws {wildcards.window_size} -o {output}"
使用上述的 config.yaml
示例,第一个规则 - all
规则 - 请求以下文件:results/1000_5.csv
、results/1000_10.csv
、results/2000_5.csv
和 results/2000_10.csv
。Snakemake 通过设置通配符值来识别规则 preprocess_coverage
可以生成这些文件,然后在规则的 shell
指令中将其作为脚本的参数访问。
有关通配符的更多信息,请参阅此处,有关 expand
函数的更多信息,请参阅此处。
在 expand
函数中,您可以提供任何组合函数以生成结果,例如,要生成仅 results/1000_5.csv
和 results/2000_10.csv
,可以使用 zip
函数:
rule all:
input:
expand(
"results/{window_size}_{slide_size}.csv",
zip,
window_size=config["window_size"],
slide_size=config["slide_size"]
),
对于 Snakemake 5.31 及更高版本,还有另一种选择:参数空间探索,在Snakemake 手册中有详细描述。ParamSpace 简化了上述方法,但它以更少的代码执行相同的操作。我建议首先尝试上述方法,因为了解上述方法对于理解 ParamSpace 如何工作非常重要。
英文:
There are multiple layers of the problem.
First, I would prefer passing configfile to snakemake call, e.g.: snakemake --config config.yaml
with the config.yaml
containing values for parameters. There, you can specify a list of values for the parameter, such as:
window_size:
- 1000
- 2000
slide_size:
- 5
- 10
The second problem is how to run a rule with a list of parameter values.
I would suggest specifying outputs for all parameter values in the all
, with the downstream rules taking parameter values from wildcards.
# this rule requests all combinations of these two parameters, that you passed as a list from the config.
rule all:
input:
expand(
"results/{window_size}_{slide_size}.csv",
window_size=config["window_size"],
slide_size=config["slide_size"]
),
# this rule parses from the output parameter values, this is accessed using wildcards object
rule preprocess_coverage:
input:
...,
output:
"results/{window_size}_{slide_size}.csv"
shell:
"... -sw {wildcards.slide_size} -ws {wildcards.window_size} -o {output}"
Using the example config.yaml
from above, the first rule - all
rule - requests the following files: results/1000_5.csv
, results/1000_10.csv
, results/2000_5.csv
and results/2000_10.csv
. Snakemake recognize that the rule preprocess_coverage
can produce these files by setting the wildcards values, that are then accessed in the shell
directive of the rule as parameters for the script.
See more about wildcards here, and about expand function here.
In expand
function you can provide any combinatorial function to produce the results, i.e. to produce only results/1000_5.csv
and results/2000_10.csv
use the zip
function:
rule all:
input:
expand(
"results/{window_size}_{slide_size}.csv",
zip,
window_size=config["window_size"],
slide_size=config["slide_size"]
),
With Snakemake 5.31 and above, there is also another option: parameter space exploration, there is a well-written description in the Snakemake manual. ParamSpace simplifies the above approach, but it works the same with less code. I would recommend to try the above first, as it is important to understand the above to understand how ParamSpace works.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论