英文:
Is there python way to match fragments within range
问题
我有一些基因被不同的reads覆盖。一个问题是在某些情况下,一个基因可能被多个reads覆盖。我希望捕获所有符合基因长度范围(第6列(基因起始)和第7列(基因终止))的reads片段,并添加一个带有基因名字、字母“a”以及连续发现的情况下我想要的基因名字与b的结束字段。
输入:
```none
contig 230 3888 read1 1 8397 Gene1
contig 3343 3885 read2 1 8397 Gene1
contig 6030 8257 read4 1 8397 Gene1
contig 6030 8257 read10 1 8397 Gene1
contig 6260 8257 read7 1 8397 Gene1
contig 6030 8257 read8 1 8397 Gene1
contig 6190 6345 read10 1 8397 Gene1
contig 0 3124 read27 26 406 Gene12
contig 100 200 read30 26 406 Gene12
期望输出:
contig 230 3888 read1 1 8397 Gene1 Gene1a
contig 3343 3885 read2 1 8397 Gene1 covered
contig 6030 8257 read4 1 8397 Gene1 Gene1b
contig 6030 8257 read10 1 8397 Gene1 covered
contig 6260 8257 read7 1 8397 Gene1 covered
contig 6030 8257 read8 1 8397 Gene1 covered
contig 6190 6345 read10 1 8397 Gene1 covered
contig 0 3124 read27 26 406 Gene12 Gene12a
contig 100 200 read30 26 406 Gene12 covered
代码:
打开输入文件,用“r”方式 as f_in,打开输出文件,用“w”方式 as f_out:
gene_dict = {}
对于f_in中的每一行:
fields = line.strip().split()
gene_name = fields[6]
read_start = int(fields[1])
read_end = int(fields[2])
gene_start = int(fields[4])
gene_end = int(fields[5])
如果基因名不在gene_dict中:
gene_dict[gene_name] = []
找到基因 = False
对于gene_dict[gene_name]中的每个基因片段:
frag_start = gene_frag[0]
frag_end = gene_frag[1]
如果read_start <= frag_start <= read_end 或者 read_start <= frag_end <= read_end:
找到基因 = True
退出循环
如果找到基因:
gene_dict[gene_name].append((gene_start, gene_end))
gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
fields.append(f"{gene_name}{gene_count}")
否则:
gene_dict[gene_name] = [(gene_start, gene_end)]
fields.append(f"{gene_name}")
f_out.write("\t".join(fields) + "\n")
英文:
I have a few genes that have been covered by different reads. One issue is that in certain cases where one gene is covered by multiple reads. I wish to catch all the fragments of reads that fit with gene length (range) [column 6 (genestart) and column 7 genestop] and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.
Input:
contig 230 3888 read1 1 8397 Gene1
contig 3343 3885 read2 1 8397 Gene1
contig 6030 8257 read4 1 8397 Gene1
contig 6030 8257 read10 1 8397 Gene1
contig 6260 8257 read7 1 8397 Gene1
contig 6030 8257 read8 1 8397 Gene1
contig 6190 6345 read10 1 8397 Gene1
contig 0 3124 read27 26 406 Gene12
contig 100 200 read30 26 406 Gene12
Desired output:
contig 230 3888 read1 1 8397 Gene1 Gene1a
contig 3343 3885 read2 1 8397 Gene1 covered
contig 6030 8257 read4 1 8397 Gene1 Gene1b
contig 6030 8257 read10 1 8397 Gene1 covered
contig 6260 8257 read7 1 8397 Gene1 covered
contig 6030 8257 read8 1 8397 Gene1 covered
contig 6190 6345 read10 1 8397 Gene1 covered
contig 0 3124 read27 26 406 Gene12 Gene12a
contig 100 200 read30 26 406 Gene12 covered
Code:
with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
gene_dict = {}
for line in f_in:
fields = line.strip().split()
gene_name = fields[6]
read_start = int(fields[1])
read_end = int(fields[2])
gene_start = int(fields[4])
gene_end = int(fields[5])
if gene_name not in gene_dict:
gene_dict[gene_name] = []
gene_found = False
for gene_frag in gene_dict[gene_name]:
frag_start = gene_frag[0]
frag_end = gene_frag[1]
if read_start <= frag_start <= read_end or read_start <= frag_end <= read_end:
gene_found = True
break
if gene_found:
gene_dict[gene_name].append((gene_start, gene_end))
gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
fields.append(f"{gene_name}{gene_count}")
else:
gene_dict[gene_name] = [(gene_start, gene_end)]
fields.append(f"{gene_name}")
f_out.write("\t".join(fields) + "\n")
I am not getting the desired output. The output I got
contig 230 3888 read1 1 8397 Gene1 Gene1
contig 3343 3885 read2 1 8397 Gene1 Gene1
contig 6030 8257 read4 1 8397 Gene1 Gene1
contig 6030 8257 read10 1 8397 Gene1 Gene1
contig 6260 8257 read7 1 8397 Gene1 Gene1
contig 6030 8257 read8 1 8397 Gene1 Gene1
contig 6190 6345 read10 1 8397 Gene1 Gene1
contig 0 3124 read27 26 406 Gene12 Gene12
contig 100 200 read30 26 406 Gene12 Gene12
The file is tab-delimited although here I haven't constructed the example tab.
Please provide pointers on where I am going wrong.
答案1
得分: 1
I'm here to assist with the Chinese translation of the provided content. Please let me know which parts you'd like translated.
英文:
The below code has the desired output. I've taken the input as a multiline string and output via prints for ease on my end, so some adapting required.
I've tried to comment the steps, but do ask if not clear. I've used range()
and set()
logic to find overlaps.
data='''contig 230 3888 read1 1 8397 Gene1
contig 3343 3885 read2 1 8397 Gene1
contig 6030 8257 read4 1 8397 Gene1
contig 6030 8257 read10 1 8397 Gene1
contig 6260 8257 read7 1 8397 Gene1
contig 6030 8257 read8 1 8397 Gene1
contig 6190 6345 read10 1 8397 Gene1
contig 0 3124 read27 26 406 Gene12
contig 100 200 read30 26 406 Gene12 '''
#initialise variables
last_gene=''
gene_span=range(0)
total_read_span=range(0)
for contig in data.split('\n'): #iterate over contigs
fields=contig.split() #split lines into separate fields. Will need to be .split('\t) for tab-separated data
read_span=range(int(fields[1]),int(fields[2])+1)
if (gene := fields[6]) != last_gene: #check if new gene or previous gene
last_gene=gene
alphabet=iter('abcdefghijklmnopqrstuvqxyz') #set up an alphabet iterator to assign gene suffixes
gene_span=range(int(fields[4]),int(fields[5])+1)
total_read_span=range(0) #initialise total_read_span, i.e. the range of the sequence read
if len(set(read_span)&set(gene_span)): #check if read or gene spans overlap (the length of a set of intersecting values > 0)
if len(set(read_span)-set(total_read_span)): #check if the contig read has values outside of current total read span
print(f'{contig} {gene}{next(alphabet)}')
total_read_span=range(min(total_read_span.start,read_span.start),max(total_read_span.stop,read_span.stop)) #update total read span
else:
print(f'{contig} covered')
else:
print(f'{fields[3]} has no overlap with {gene}')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论