英文:
How to approach workflows sharing some input data in snakemake?
问题
我有多个 Snakemake 工作流程,其中一些共享某些规则的输出 - 例如,构建 bwa/bowtie2 索引的规则等。如何管理这种情况?最近我发现 工作流程缓存,但底部有一个注释说这是一个实验性实现。关于这方面的更多信息相当有限。我的问题是 - 使用上述缓存实现是否存在任何注意事项或未解决的问题?或者是否有其他可以节省资源的方法?
我计划在 HPC 环境中使用它,并运行多个工作流程,同时可能有其他用户运行相同工作流程的实例。谢谢。
我尝试了两种方法。起初,我保持工作流程完全分离和独立,但在某些情况下,这开始浪费大量时间和磁盘空间,例如人类基因组。然后我尝试在一个单独的目录中管理它,将输出保留在那里,然后链接到工作流程。很快就证明这是错误的,难以管理,因为工作流程可能使用不同的工具版本或使用不同的参数。
英文:
I have multiple snakemake workflows with some of them sharing some rule outputs - for example a rule to construct bwa/bowtie2 indexes and so on. How to manage this situation? Recently I found between workflow caching, however there is a note in the bottom that this is an experimental implementation. More information about this is pretty scarce. My question is - are there any caveats or unsolved problems using the mentioned cache implementation? Or are there other ways that could save resources?
I plan to use it in HPC setting and running multiple workflows, while also with potentially other users running instances of same workflows. Thanks.
I tried two approaches. At first I kept workflows completely separated and independent, but this started to be a huge waste of time and disk space in some cases, human genome for example. Then I tried to manage it in a separate directory keeping outputs there and then linking to workflows. This proved soon to be erroneous and hard to manage, as workflows could use different tool versions or use different parameters.
答案1
得分: 2
这种模式可能没有单一的解决方案。这类似于常规软件开发中需要重复编码的情况。两种极端的解决方案是:
- 将共同部分隔离到一个单独的工作流中,成为其他工作流的依赖项;
- 保留共同部分,但使用
ancient
来保护重叠部分,以避免基于时间的重新计算。
尽管第一种选项可能很诱人,就像严格的DRY原则一样,但可能会影响以后的工作流灵活性。例如,如果对这个工作流进行了一些调整,导致与下游工作流的兼容性发生变化。
第二种选项可能看起来效率较低,但如果情况需要独立开发这些工作流,那么在跨工作流同步变更时将需要较少或不需要努力。
英文:
It's doubtful that there is a single solution for this pattern. This is analogous to the situations that require code repetition in regular software development. The two extreme solutions are:
- isolate the common parts into a separate workflow that becomes a dependency for the other workflows;
- keep the common parts but protect the overlapping parts using
ancient
to avoid time-based recalculation.
While the first option might be tempting, just like strict DRY adherence, it might hurt workflow flexibility down the road. For example, if some adjustments are introduced into this workflow that alter compatibility with downstream workflows.
The second option might seem inefficient, however it can be better if the circumstances require independent development of these workflows, since less/no effort will be needed to synchronize changes across the workflows.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论