SQL(ite)快速检索较大字符串(基因组)的多个子字符串

huangapple go评论71阅读模式
英文:

SQL(ite) fast retrieval of multiple substrings of larger string (genome)

问题

我有一个包含基因组、染色体和“有趣”区域(以BED格式表示)的SQL数据库。一个基因组(例如4GB)由+/- 20个染色体组成,所以每个字符串大约有200MB。例如,基因组中的染色体由以下字符串组成:

NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB大

假设我有大约1,000,000个ATAC-seq峰值区域,我想要获取仅位于染色体3上的100个碱基对的序列。我的SQL查询看起来像这样:

SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'

这种查询的问题在于,对于每个单独的匹配,都会加载Chr.Sequence,使RAM使用过于庞大,查询变得非常缓慢。我“修复”这个问题的方式是仅使用SQL数据库存储有趣位置的位置,然后使用pyfaidx快速从染色体中获取相应的序列。

我想知道是否可能在SQL(ite)中进行快速查找,因为当前的解决方案对我来说有点临时。

英文:

I have a sql database consisting of a genome, its chromosomes, and "interesting" regions (in BED format). A genome (e.g. 4GB) consists of +/- 20 chromosomes, so each string is around 200MB large. The chromosome in a genome for instance consists of the string:

NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB large

Let's say I have around 1.000.000 regions ATAC-seq peaks and I would like to get the sequences of 100 bp of only chromosome 3. My SQL query would look like this:

SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'

The problem with these kind of lookups is that the Chr.Sequence is loaded for every single hit, making the ram usage excessively large, and the lookup really slow. The way I 'fixed' this is only using the SQL database to store the positions of the interesting locations, and then using pyfaidx to quickly get the corresponding sequences out of the chromosome.

I am wondering if a fast lookup is possible in SQL(ite), since the current solution feels a bit ad-hoc to me.

答案1

得分: 2

你可以将基因组字符串分成多个部分,同时使用多进程搜索子字符串,以减少单个处理器的使用并加快结果速度。

import sys
import multiprocessing
from multiprocessing import Pool

def SplitGenomeString(start, length):
    # 在全局变量中使用数据库连接
    cursorObj = con.cursor()

    cursorObj.execute('SELECT substr(Chr.Sequence, {}, {}) FROM Chromosome Chr'.format(str(start), str(length)))
    return cursorObj.fetchall()

def getSubSequence(s):
    # 根据需要编写查询以查找子序列 s

if __name__ == '__main__':
    length = SplitGenomeString(0, sys.maxsize)
    cores = multiprocessing.cpu_count()

    # 假设要检查的子序列长度为 100
    subseq_len = 100

    ls = []
    for i in range(0, length, round(length / cores)):
        ls.append(SplitGenomeString(i, round(length / cores)))

    # 这将包括由于拆分基因组字符串而被排除的部分
    temp = []
    for i in range(len(ls) - 1):
        temp.append(ls[i][1 - subseq_len:] + ls[i + 1][:subseq_len - 1])

    ls = ls + temp

    with Pool(cores) as p:
        p.map(getSubSequence, ls)

你可以在这个链接中了解有关多进程的更多信息:https://docs.python.org/3/library/multiprocessing.html

英文:

You can break genome string into parts and search for substrings simultaneously using multiprocessing to minimize the single processor usage and faster results.

import sys
import multiprocessing
from multiprocessing import Pool

def SplitGenomeString(start,length):
#where con in sqlconnection to database using as global variable
   cursorObj = con.cursor()


   cursorObj.execute('SELECT substr(Chr.Sequence,{},{}) FROM Chromosome 
                      Chr'.format(str(start),str(length))
   return cursorObj.fetchall()

def getSubSequence(s):
#Write your Queries according to your requirements for finding subsequence s




if __name__ == '__main__':
    length = SplitGenomeString(0,sys.maxint)
    cores = multiprocessing.cpu_count()

    #asumming the subsequence you want to check is 100
    subseq_len = 100

    for i in range(0,length,round(length/cores)):
        ls.append(SplitGenomeString(i,round(length/cores)))

    #this will also include the excluded parts because of splitting geneome string
    temp = []
    for i in range(len(ls) - 1):
        temp.append(ls[i][1 - subseq_len):] + ls[i+1][:subseq_len - 1])

    ls = ls + temp

    with Pool(cores) as p:
        p.map(getSubSequence,ls)

You can see about multiprocessing in this link:
https://docs.python.org/3/library/multiprocessing.html

Edited as told

huangapple
  • 本文由 发表于 2020年1月6日 17:33:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59609641.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定