2020年1月6日 17:33:13go评论106阅读模式

英文:

SQL(ite) fast retrieval of multiple substrings of larger string (genome)

问题

我有一个包含基因组、染色体和“有趣”区域（以BED格式表示）的SQL数据库。一个基因组（例如4GB）由+/- 20个染色体组成，所以每个字符串大约有200MB。例如，基因组中的染色体由以下字符串组成：

NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB大

假设我有大约1,000,000个ATAC-seq峰值区域，我想要获取仅位于染色体3上的100个碱基对的序列。我的SQL查询看起来像这样：

SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'

这种查询的问题在于，对于每个单独的匹配，都会加载Chr.Sequence，使RAM使用过于庞大，查询变得非常缓慢。我“修复”这个问题的方式是仅使用SQL数据库存储有趣位置的位置，然后使用pyfaidx快速从染色体中获取相应的序列。

我想知道是否可能在SQL（ite）中进行快速查找，因为当前的解决方案对我来说有点临时。

英文:

I have a sql database consisting of a genome, its chromosomes, and "interesting" regions (in BED format). A genome (e.g. 4GB) consists of +/- 20 chromosomes, so each string is around 200MB large. The chromosome in a genome for instance consists of the string:

NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB large

Let's say I have around 1.000.000 regions ATAC-seq peaks and I would like to get the sequences of 100 bp of only chromosome 3. My SQL query would look like this:

SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = &#39;chr3&#39;

The problem with these kind of lookups is that the Chr.Sequence is loaded for every single hit, making the ram usage excessively large, and the lookup really slow. The way I 'fixed' this is only using the SQL database to store the positions of the interesting locations, and then using pyfaidx to quickly get the corresponding sequences out of the chromosome.

I am wondering if a fast lookup is possible in SQL(ite), since the current solution feels a bit ad-hoc to me.

答案1

得分: 2

你可以将基因组字符串分成多个部分，同时使用多进程搜索子字符串，以减少单个处理器的使用并加快结果速度。

import sys
import multiprocessing
from multiprocessing import Pool
def SplitGenomeString(start, length):
    # 在全局变量中使用数据库连接
    cursorObj = con.cursor()
    cursorObj.execute('SELECT substr(Chr.Sequence, {}, {}) FROM Chromosome Chr'.format(str(start), str(length)))
    return cursorObj.fetchall()
def getSubSequence(s):
    # 根据需要编写查询以查找子序列 s
if __name__ == '__main__':
    length = SplitGenomeString(0, sys.maxsize)
    cores = multiprocessing.cpu_count()
    # 假设要检查的子序列长度为 100
    subseq_len = 100
    ls = []
    for i in range(0, length, round(length / cores)):
        ls.append(SplitGenomeString(i, round(length / cores)))
    # 这将包括由于拆分基因组字符串而被排除的部分
    temp = []
    for i in range(len(ls) - 1):
        temp.append(ls[i][1 - subseq_len:] + ls[i + 1][:subseq_len - 1])
    ls = ls + temp
    with Pool(cores) as p:
        p.map(getSubSequence, ls)

你可以在这个链接中了解有关多进程的更多信息：https://docs.python.org/3/library/multiprocessing.html

英文:

You can break genome string into parts and search for substrings simultaneously using multiprocessing to minimize the single processor usage and faster results.

import sys
import multiprocessing
from multiprocessing import Pool
def SplitGenomeString(start,length):
#where con in sqlconnection to database using as global variable
   cursorObj = con.cursor()
   cursorObj.execute(&#39;SELECT substr(Chr.Sequence,{},{}) FROM Chromosome 
                      Chr&#39;.format(str(start),str(length))
   return cursorObj.fetchall()
def getSubSequence(s):
#Write your Queries according to your requirements for finding subsequence s
if __name__ == &#39;__main__&#39;:
    length = SplitGenomeString(0,sys.maxint)
    cores = multiprocessing.cpu_count()
    #asumming the subsequence you want to check is 100
    subseq_len = 100
    for i in range(0,length,round(length/cores)):
        ls.append(SplitGenomeString(i,round(length/cores)))
    #this will also include the excluded parts because of splitting geneome string
    temp = []
    for i in range(len(ls) - 1):
        temp.append(ls[i][1 - subseq_len):] + ls[i+1][:subseq_len - 1])
    ls = ls + temp
    with Pool(cores) as p:
        p.map(getSubSequence,ls)

You can see about multiprocessing in this link:
https://docs.python.org/3/library/multiprocessing.html

Edited as told

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SQL(ite)快速检索较大字符串（基因组）的多个子字符串

问题

答案1

在QML地图小部件中使用不同图像作为标记的多个标记。

使用Selenium进行网页爬取时加载更多

列表操作与索引

使用Graphene进行GraphQL多个查询和变更

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。