有没有Python的方法来匹配范围内的片段

huangapple go评论68阅读模式
英文:

Is there python way to match fragments within range

问题

我有一些基因被不同的reads覆盖。一个问题是在某些情况下,一个基因可能被多个reads覆盖。我希望捕获所有符合基因长度范围(第6列(基因起始)和第7列(基因终止))的reads片段,并添加一个带有基因名字、字母“a”以及连续发现的情况下我想要的基因名字与b的结束字段。

输入:
```none
contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12

期望输出:

contig  230  3888 read1  1 8397 Gene1 Gene1a
contig  3343 3885 read2  1 8397 Gene1 covered
contig  6030 8257 read4  1 8397 Gene1 Gene1b
contig  6030 8257 read10 1 8397 Gene1 covered
contig  6260 8257 read7  1 8397 Gene1 covered
contig  6030 8257 read8  1 8397 Gene1 covered
contig  6190 6345 read10 1 8397 Gene1 covered
contig   0   3124 read27 26 406 Gene12 Gene12a
contig  100  200  read30 26 406 Gene12 covered

代码:

打开输入文件,用“r”方式 as f_in,打开输出文件,用“w”方式 as f_out:
    gene_dict = {}
    对于f_in中的每一行:
        fields = line.strip().split()
        gene_name = fields[6]
        read_start = int(fields[1])
        read_end = int(fields[2])
        gene_start = int(fields[4])
        gene_end = int(fields[5])
        如果基因名不在gene_dict中:
            gene_dict[gene_name] = []
        找到基因 = False
        对于gene_dict[gene_name]中的每个基因片段:
            frag_start = gene_frag[0]
            frag_end = gene_frag[1]
            如果read_start <= frag_start <= read_end 或者 read_start <= frag_end <= read_end:
                找到基因 = True
                退出循环
        如果找到基因:
            gene_dict[gene_name].append((gene_start, gene_end))
            gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
            fields.append(f"{gene_name}{gene_count}")
        否则:
            gene_dict[gene_name] = [(gene_start, gene_end)]
            fields.append(f"{gene_name}")
        f_out.write("\t".join(fields) + "\n")
英文:

I have a few genes that have been covered by different reads. One issue is that in certain cases where one gene is covered by multiple reads. I wish to catch all the fragments of reads that fit with gene length (range) [column 6 (genestart) and column 7 genestop] and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.

Input:

contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12 

Desired output:

contig  230  3888 read1  1 8397 Gene1 Gene1a
contig  3343 3885 read2  1 8397 Gene1 covered
contig  6030 8257 read4  1 8397 Gene1 Gene1b
contig  6030 8257 read10 1 8397 Gene1 covered
contig  6260 8257 read7  1 8397 Gene1 covered
contig  6030 8257 read8  1 8397 Gene1 covered
contig  6190 6345 read10 1 8397 Gene1 covered
contig   0   3124 read27 26 406 Gene12 Gene12a
contig  100  200  read30 26 406 Gene12 covered

Code:

with open(input_file, &quot;r&quot;) as f_in, open(output_file, &quot;w&quot;) as f_out:
    gene_dict = {}
    for line in f_in:
        fields = line.strip().split()
        gene_name = fields[6]
        read_start = int(fields[1])
        read_end = int(fields[2])
        gene_start = int(fields[4])
        gene_end = int(fields[5])
        if gene_name not in gene_dict:
            gene_dict[gene_name] = []
        gene_found = False
        for gene_frag in gene_dict[gene_name]:
            frag_start = gene_frag[0]
            frag_end = gene_frag[1]
            if read_start &lt;= frag_start &lt;= read_end or read_start &lt;= frag_end &lt;= read_end:
                gene_found = True
                break
        if gene_found:
            gene_dict[gene_name].append((gene_start, gene_end))
            gene_count = chr(ord(&#39;a&#39;) + len(gene_dict[gene_name]) - 1)
            fields.append(f&quot;{gene_name}{gene_count}&quot;)
        else:
            gene_dict[gene_name] = [(gene_start, gene_end)]
            fields.append(f&quot;{gene_name}&quot;)
        f_out.write(&quot;\t&quot;.join(fields) + &quot;\n&quot;)

I am not getting the desired output. The output I got

contig  230  3888 read1  1 8397 Gene1 Gene1
contig  3343 3885 read2  1 8397 Gene1 Gene1
contig  6030 8257 read4  1 8397 Gene1 Gene1
contig  6030 8257 read10 1 8397 Gene1 Gene1
contig  6260 8257 read7  1 8397 Gene1 Gene1
contig  6030 8257 read8  1 8397 Gene1 Gene1
contig  6190 6345 read10 1 8397 Gene1 Gene1
contig   0   3124 read27 26 406 Gene12 Gene12
contig  100  200  read30 26 406 Gene12 Gene12

The file is tab-delimited although here I haven't constructed the example tab.
Please provide pointers on where I am going wrong.

答案1

得分: 1

I'm here to assist with the Chinese translation of the provided content. Please let me know which parts you'd like translated.

英文:

The below code has the desired output. I've taken the input as a multiline string and output via prints for ease on my end, so some adapting required.

I've tried to comment the steps, but do ask if not clear. I've used range() and set() logic to find overlaps.

data=&#39;&#39;&#39;contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12 &#39;&#39;&#39;

#initialise variables
last_gene=&#39;&#39;
gene_span=range(0)
total_read_span=range(0)

for contig in data.split(&#39;\n&#39;): #iterate over contigs
	fields=contig.split() #split lines into separate fields. Will need to be .split(&#39;\t) for tab-separated data
	read_span=range(int(fields[1]),int(fields[2])+1)
	if (gene := fields[6]) != last_gene: #check if new gene or previous gene
		last_gene=gene
		alphabet=iter(&#39;abcdefghijklmnopqrstuvqxyz&#39;) #set up an alphabet iterator to assign gene suffixes
		gene_span=range(int(fields[4]),int(fields[5])+1)
		total_read_span=range(0) #initialise total_read_span, i.e. the range of the sequence read
	if len(set(read_span)&amp;set(gene_span)): #check if read or gene spans overlap (the length of a set of intersecting values &gt; 0)
		if len(set(read_span)-set(total_read_span)): #check if the contig read has values outside of current total read span
			print(f&#39;{contig} {gene}{next(alphabet)}&#39;)
			total_read_span=range(min(total_read_span.start,read_span.start),max(total_read_span.stop,read_span.stop)) #update total read span
		else:
			print(f&#39;{contig} covered&#39;)
	else:
		print(f&#39;{fields[3]} has no overlap with {gene}&#39;)

huangapple
  • 本文由 发表于 2023年5月10日 21:42:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76219172.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定