问题

我有一些基因被不同的reads覆盖。一个问题是在某些情况下，一个基因可能被多个reads覆盖。我希望捕获所有符合基因长度范围（第6列（基因起始）和第7列（基因终止））的reads片段，并添加一个带有基因名字、字母“a”以及连续发现的情况下我想要的基因名字与b的结束字段。

输入：
```none
contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12

期望输出：

contig  230  3888 read1  1 8397 Gene1 Gene1a
contig  3343 3885 read2  1 8397 Gene1 covered
contig  6030 8257 read4  1 8397 Gene1 Gene1b
contig  6030 8257 read10 1 8397 Gene1 covered
contig  6260 8257 read7  1 8397 Gene1 covered
contig  6030 8257 read8  1 8397 Gene1 covered
contig  6190 6345 read10 1 8397 Gene1 covered
contig   0   3124 read27 26 406 Gene12 Gene12a
contig  100  200  read30 26 406 Gene12 covered

代码：

打开输入文件，用“r”方式 as f_in，打开输出文件，用“w”方式 as f_out：
    gene_dict = {}
    对于f_in中的每一行：
        fields = line.strip().split()
        gene_name = fields[6]
        read_start = int(fields[1])
        read_end = int(fields[2])
        gene_start = int(fields[4])
        gene_end = int(fields[5])
        如果基因名不在gene_dict中：
            gene_dict[gene_name] = []
        找到基因 = False
        对于gene_dict[gene_name]中的每个基因片段：
            frag_start = gene_frag[0]
            frag_end = gene_frag[1]
            如果read_start <= frag_start <= read_end 或者 read_start <= frag_end <= read_end：
                找到基因 = True
                退出循环
        如果找到基因：
            gene_dict[gene_name].append((gene_start, gene_end))
            gene_count = chr(ord('a') + len(gene_dict[gene_name]) - 1)
            fields.append(f"{gene_name}{gene_count}")
        否则：
            gene_dict[gene_name] = [(gene_start, gene_end)]
            fields.append(f"{gene_name}")
        f_out.write("\t".join(fields) + "\n")

英文:

I have a few genes that have been covered by different reads. One issue is that in certain cases where one gene is covered by multiple reads. I wish to catch all the fragments of reads that fit with gene length (range) [column 6 (genestart) and column 7 genestop] and add an end field stating the gene name along with alphabet "a" and for successive such finds I want the gene name with b added to it.

Input:

contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12

Desired output:

contig  230  3888 read1  1 8397 Gene1 Gene1a
contig  3343 3885 read2  1 8397 Gene1 covered
contig  6030 8257 read4  1 8397 Gene1 Gene1b
contig  6030 8257 read10 1 8397 Gene1 covered
contig  6260 8257 read7  1 8397 Gene1 covered
contig  6030 8257 read8  1 8397 Gene1 covered
contig  6190 6345 read10 1 8397 Gene1 covered
contig   0   3124 read27 26 406 Gene12 Gene12a
contig  100  200  read30 26 406 Gene12 covered

Code:

with open(input_file, &quot;r&quot;) as f_in, open(output_file, &quot;w&quot;) as f_out:
    gene_dict = {}
    for line in f_in:
        fields = line.strip().split()
        gene_name = fields[6]
        read_start = int(fields[1])
        read_end = int(fields[2])
        gene_start = int(fields[4])
        gene_end = int(fields[5])
        if gene_name not in gene_dict:
            gene_dict[gene_name] = []
        gene_found = False
        for gene_frag in gene_dict[gene_name]:
            frag_start = gene_frag[0]
            frag_end = gene_frag[1]
            if read_start &lt;= frag_start &lt;= read_end or read_start &lt;= frag_end &lt;= read_end:
                gene_found = True
                break
        if gene_found:
            gene_dict[gene_name].append((gene_start, gene_end))
            gene_count = chr(ord(&#39;a&#39;) + len(gene_dict[gene_name]) - 1)
            fields.append(f&quot;{gene_name}{gene_count}&quot;)
        else:
            gene_dict[gene_name] = [(gene_start, gene_end)]
            fields.append(f&quot;{gene_name}&quot;)
        f_out.write(&quot;\t&quot;.join(fields) + &quot;\n&quot;)

I am not getting the desired output. The output I got

contig  230  3888 read1  1 8397 Gene1 Gene1
contig  3343 3885 read2  1 8397 Gene1 Gene1
contig  6030 8257 read4  1 8397 Gene1 Gene1
contig  6030 8257 read10 1 8397 Gene1 Gene1
contig  6260 8257 read7  1 8397 Gene1 Gene1
contig  6030 8257 read8  1 8397 Gene1 Gene1
contig  6190 6345 read10 1 8397 Gene1 Gene1
contig   0   3124 read27 26 406 Gene12 Gene12
contig  100  200  read30 26 406 Gene12 Gene12

The file is tab-delimited although here I haven't constructed the example tab.
Please provide pointers on where I am going wrong.

答案1

得分: 1

I'm here to assist with the Chinese translation of the provided content. Please let me know which parts you'd like translated.

英文:

The below code has the desired output. I've taken the input as a multiline string and output via prints for ease on my end, so some adapting required.

I've tried to comment the steps, but do ask if not clear. I've used range() and set() logic to find overlaps.

data=&#39;&#39;&#39;contig  230  3888 read1  1 8397 Gene1
contig  3343 3885 read2  1 8397 Gene1
contig  6030 8257 read4  1 8397 Gene1
contig  6030 8257 read10 1 8397 Gene1
contig  6260 8257 read7  1 8397 Gene1
contig  6030 8257 read8  1 8397 Gene1
contig  6190 6345 read10 1 8397 Gene1
contig   0   3124 read27 26 406 Gene12
contig  100  200  read30 26 406 Gene12 &#39;&#39;&#39;

#initialise variables
last_gene=&#39;&#39;
gene_span=range(0)
total_read_span=range(0)

for contig in data.split(&#39;\n&#39;): #iterate over contigs
	fields=contig.split() #split lines into separate fields. Will need to be .split(&#39;\t) for tab-separated data
	read_span=range(int(fields[1]),int(fields[2])+1)
	if (gene := fields[6]) != last_gene: #check if new gene or previous gene
		last_gene=gene
		alphabet=iter(&#39;abcdefghijklmnopqrstuvqxyz&#39;) #set up an alphabet iterator to assign gene suffixes
		gene_span=range(int(fields[4]),int(fields[5])+1)
		total_read_span=range(0) #initialise total_read_span, i.e. the range of the sequence read
	if len(set(read_span)&amp;set(gene_span)): #check if read or gene spans overlap (the length of a set of intersecting values &gt; 0)
		if len(set(read_span)-set(total_read_span)): #check if the contig read has values outside of current total read span
			print(f&#39;{contig} {gene}{next(alphabet)}&#39;)
			total_read_span=range(min(total_read_span.start,read_span.start),max(total_read_span.stop,read_span.stop)) #update total read span
		else:
			print(f&#39;{contig} covered&#39;)
	else:
		print(f&#39;{fields[3]} has no overlap with {gene}&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有Python的方法来匹配范围内的片段

问题

答案1

在Python中如何合并字节文件

如何在CircuitPython中对1到X的整数列表进行无重复随机化？

`brokenaxes` AttributeError: ‘SubplotSpec’对象没有属性’is_last_row’

如何将数据从我的HTML Flask传递到我的SQLite3数据库（Python）？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论