英文:
SQL(ite) fast retrieval of multiple substrings of larger string (genome)
问题
我有一个包含基因组、染色体和“有趣”区域(以BED格式表示)的SQL数据库。一个基因组(例如4GB)由+/- 20个染色体组成,所以每个字符串大约有200MB。例如,基因组中的染色体由以下字符串组成:
NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB大
假设我有大约1,000,000个ATAC-seq峰值区域,我想要获取仅位于染色体3上的100个碱基对的序列。我的SQL查询看起来像这样:
SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'
这种查询的问题在于,对于每个单独的匹配,都会加载Chr.Sequence
,使RAM使用过于庞大,查询变得非常缓慢。我“修复”这个问题的方式是仅使用SQL数据库存储有趣位置的位置,然后使用pyfaidx
快速从染色体中获取相应的序列。
我想知道是否可能在SQL(ite)中进行快速查找,因为当前的解决方案对我来说有点临时。
英文:
I have a sql database consisting of a genome, its chromosomes, and "interesting" regions (in BED format). A genome (e.g. 4GB) consists of +/- 20 chromosomes, so each string is around 200MB large. The chromosome in a genome for instance consists of the string:
NNNNATCCAGGAGAATTACAT...ACCGGGAATTCCCGGNNNNN # 200 MB large
Let's say I have around 1.000.000 regions ATAC-seq peaks and I would like to get the sequences of 100 bp of only chromosome 3. My SQL query would look like this:
SELECT substr(Chr.Sequence, Bed.ChromStart + Bed.Peak - 50, 100) FROM Bed Bed
INNER JOIN Chromosome Chr ON Bed.ChromosomeId = Chr.ChromosomeId
WHERE Chr.Chromosome = 'chr3'
The problem with these kind of lookups is that the Chr.Sequence
is loaded for every single hit, making the ram usage excessively large, and the lookup really slow. The way I 'fixed' this is only using the SQL database to store the positions of the interesting locations, and then using pyfaidx
to quickly get the corresponding sequences out of the chromosome.
I am wondering if a fast lookup is possible in SQL(ite), since the current solution feels a bit ad-hoc to me.
答案1
得分: 2
你可以将基因组字符串分成多个部分,同时使用多进程搜索子字符串,以减少单个处理器的使用并加快结果速度。
import sys
import multiprocessing
from multiprocessing import Pool
def SplitGenomeString(start, length):
# 在全局变量中使用数据库连接
cursorObj = con.cursor()
cursorObj.execute('SELECT substr(Chr.Sequence, {}, {}) FROM Chromosome Chr'.format(str(start), str(length)))
return cursorObj.fetchall()
def getSubSequence(s):
# 根据需要编写查询以查找子序列 s
if __name__ == '__main__':
length = SplitGenomeString(0, sys.maxsize)
cores = multiprocessing.cpu_count()
# 假设要检查的子序列长度为 100
subseq_len = 100
ls = []
for i in range(0, length, round(length / cores)):
ls.append(SplitGenomeString(i, round(length / cores)))
# 这将包括由于拆分基因组字符串而被排除的部分
temp = []
for i in range(len(ls) - 1):
temp.append(ls[i][1 - subseq_len:] + ls[i + 1][:subseq_len - 1])
ls = ls + temp
with Pool(cores) as p:
p.map(getSubSequence, ls)
你可以在这个链接中了解有关多进程的更多信息:https://docs.python.org/3/library/multiprocessing.html
英文:
You can break genome string into parts and search for substrings simultaneously using multiprocessing to minimize the single processor usage and faster results.
import sys
import multiprocessing
from multiprocessing import Pool
def SplitGenomeString(start,length):
#where con in sqlconnection to database using as global variable
cursorObj = con.cursor()
cursorObj.execute('SELECT substr(Chr.Sequence,{},{}) FROM Chromosome
Chr'.format(str(start),str(length))
return cursorObj.fetchall()
def getSubSequence(s):
#Write your Queries according to your requirements for finding subsequence s
if __name__ == '__main__':
length = SplitGenomeString(0,sys.maxint)
cores = multiprocessing.cpu_count()
#asumming the subsequence you want to check is 100
subseq_len = 100
for i in range(0,length,round(length/cores)):
ls.append(SplitGenomeString(i,round(length/cores)))
#this will also include the excluded parts because of splitting geneome string
temp = []
for i in range(len(ls) - 1):
temp.append(ls[i][1 - subseq_len):] + ls[i+1][:subseq_len - 1])
ls = ls + temp
with Pool(cores) as p:
p.map(getSubSequence,ls)
You can see about multiprocessing in this link:
https://docs.python.org/3/library/multiprocessing.html
Edited as told
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论