我的开放阅读框架(ORF)查找代码没有找到序列中最长的ORF。

huangapple go评论142阅读模式
英文:

My open reading frame (ORF) finding code is not finding the longest ORF in the sequence

问题

我明白你只需要翻译代码部分,以下是代码的翻译:

def find_orfs(sequence):
    
    # Scan through the sequence to find open reading frames
    longest_orf = ""
    strand = ""
    longest_orf_start = -1
    longest_orf_end = -1
    for i in range(3):
        # Search forward frames
        orfs = re.findall(r'(?s)ATG(?:...)*?(?:TAA|TAG|TGA)', sequence[i:])
        for orf in orfs:
            # Find the longest ORF within the sequence

            if len(orf) > len(longest_orf):
                strand = "+"
                longest_orf = orf
                longest_orf_start = i + sequence.index(orf) + 1
                longest_orf_end = i + sequence.index(orf) + len(orf)
        # Search reverse frames
        seq_rev = str(Seq(sequence).reverse_complement())
        orfs = re.findall(r'(?s)ATG(?:...)*?(?:TAA|TAG|TGA)', seq_rev[i:])
        for orf in orfs:

            # Find the longest ORF within the sequence
            if len(orf) > len(longest_orf):
                longest_orf = orf
                strand = "-"
                longest_orf_start = len(sequence) - i - seq_rev.index(orf) - len(orf) + 1
                longest_orf_end = len(sequence) - i - seq_rev.index(orf)
    print("Longest ORF:", longest_orf)
    print("Strand:", strand)
    print("Start position:", longest_orf_start)
    print("End position:", longest_orf_end)
    # Reverse complement the original DNA sequence
    seq_rev_comp = str(Seq(sequence).reverse_complement())
    # Translate the longest ORF to a protein sequence
    protein_seq = Seq(longest_orf).translate()
    print("Protein sequence:", protein_seq)
    protein_seq = str(protein_seq)

    return longest_orf, protein_seq

希望这对你有所帮助。如果你有任何其他问题,请随时提出。

英文:

I am trying to code a function that finds the longest Open reading frame. However, in this one instance it is not locating the longest ORF and I cannot figure out why.

This is the sequence:

> GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC
> GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG
> TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC
> TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG
> CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA
> AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG
> ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT
> CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT
> GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG
> CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA
> CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG
> CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA
> CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA
> GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG
> CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG
> ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA
> ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC
> GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG
> CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG
> CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT

My code says the longest ORF is 228 nucleotides in length and located between nucleotides 655 and 882. However the longest ORF is actually 327 nucleotides in length and located between nucleotides 575 and 901.

This is my code. I have made sure it does not stop at stop codons not in the reading frame but no success. Can anyone figure out why it doesn't work? I save the sequence as a fasta file and then open it and save the sequence before calling the function.

def find_orfs(sequence):

    # Scan through the sequence to find open reading frames
    longest_orf = ""
    strand = ""
    longest_orf_start = -1
    longest_orf_end = -1
    for i in range(3):
        # Search forward frames
        orfs =re.findall(r'(?s)ATG(?:...)*?(?:TAA|TAG|TGA)', sequence[i:])
        for orf in orfs:
            # Find the longest ORF within the sequence

            if len(orf) > len(longest_orf):
                strand = "+"
                longest_orf = orf
                longest_orf_start = i + sequence.index(orf) +1
                longest_orf_end = i + sequence.index(orf) + len(orf)
        # Search reverse frames
        seq_rev = str(Seq(sequence).reverse_complement())
        orfs = re.findall(r'(?s)ATG(?:...)*?(?:TAA|TAG|TGA)', seq_rev[i:])
        for orf in orfs:

            # Find the longest ORF within the sequence
            if len(orf) > len(longest_orf):
                longest_orf = orf
                strand = "-"
                longest_orf_start = len(sequence) - i - seq_rev.index(orf) - len(orf) +1
                longest_orf_end = len(sequence) - i - seq_rev.index(orf)
    print("Longest ORF:", longest_orf)
    print("Strand:", strand)
    print("Start position:", longest_orf_start)
    print("End position:", longest_orf_end)
    # Reverse complement the original DNA sequence
    seq_rev_comp = str(Seq(sequence).reverse_complement())
    # Translate the longest ORF to a protein sequence
    protein_seq = Seq(longest_orf).translate()
    print("Protein sequence:", protein_seq)
    protein_seq = str(protein_seq)

    return(longest_orf, protein_seq)

答案1

得分: 1

这段代码的问题在于正则表达式:

(?s)ATG(?:...)*?(?:TAA|TAG|TGA)

你可以访问这个链接来测试这个正则表达式与你的序列。这个正则表达式不允许重叠匹配。因此,它会找到一个匹配,消耗字符,然后分割出正确的开放阅读框。你可以通过以下图像看到这一点:

我的开放阅读框架(ORF)查找代码没有找到序列中最长的ORF。

请注意,第二个匹配涵盖了正确的orf,而它找到的下一个orf错误的答案

为了修复这个问题,我们可以使用正向预查(?=)

(?=(ATG(?:...)*?)(?:TAG|TGA|TAA))

这个正则表达式与原始正则表达式接近,只是删除了(?s)并添加了(?=)。现在,我们允许重叠匹配,可以得到我们想要的答案。现在orfs变量的两个实例应该如下所示:

orfs = re.findall(r'(?=(ATG(?:...)*?)(?:TAG|TGA|TAA))', input_seq[i:])

经过这个改变,我得到了以下输出:

最长的ORF:ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGA
链:+
起始位置:575
结束位置:901
蛋白质序列:MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS*
('ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGA', 'MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS*')
英文:

The issue with this code is the regular expression:

(?s)ATG(?:...)*?(?:TAA|TAG|TGA)

You can take a look at this website to try out the regex with your sequence. The regular expression does not allow for overlaps. So what ends up happening is that it finds a hit, consumes the characters and ends up segmenting the correct open reading frame. You can see what I am talking about with the following image:

我的开放阅读框架(ORF)查找代码没有找到序列中最长的ORF。

Notice that the second match eats into the correct orf and the next orf that it finds is the wrong answer.

To fix this we can use (?=) for positive lookahead:

(?=(ATG(?:...)*?)(?:TAG|TGA|TAA))

The regex is close to the original, except for the removal of (?s) and the addition of (?=). So now that we allow for overlaps, we can get the answer we're looking for. Both instances of the orfs variable should now look like this:

orfs = re.findall(r'(?=(ATG(?:...)*?)(?:TAG|TGA|TAA))', input_seq[i:])

Having made that change, gives me the following output:

Longest ORF: ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGA
Strand: +
Start position: 575
End position: 901
Protein sequence: MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS*
('ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGA', 'MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS*')

答案2

得分: 0

我的代码与你的类似,但使用了Python的Max()内置函数;它是从bioinformatics.stackexchange.com: 在DNA序列中查找开放阅读框中复制的。

import re
from Bio.Seq import Seq

sequence = ('GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAA'
           'CCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG'
           'TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGAAGCGTGGC'
           'TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG'
           'CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA'
           'AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG'
           'ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT'
           'CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT'
           'GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG'
           'CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA'
           'CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG'
           'CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA'
           'CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA'
           'GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG'
           'CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG'
           'ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA'
           'ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC'
           'GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG'
           'CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG'
           'CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT')

print(sequence,'\n\n')

def find_orfs(sequence):
    pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

    revcompseq = sequence[::-1].maketrans("ATGC", "TACG")  # 反向互补

    b = [(m.span(1), m.start(1), m.end(1), (m.end(1)-m.start(1)), m, m.groups()[0], 'forward') for m in re.finditer(pattern, sequence)]

    b_rev = [(m.span(1), m.start(1), m.end(1), (m.end(1)-m.start(1)), m, m.groups()[0], 'reverse') for m in re.finditer(pattern, sequence[::-1].translate(revcompseq))]

    b = max(b, key=lambda x: x[3])

    try:
        b_rev = max(b_rev, key=lambda x: x[3])

    except:

        b_rev = (0, 0, 0, 0, 0, 0)

    b_max = max((b, b_rev), key=lambda x: x[3])

    protein_seq = Seq(b_max[5]).translate()
    protein_seq = str(protein_seq)

    return (b_max[5], protein_seq)

print('\n\n', find_orfs(sequence))

输出:

((573, 897), 573, 897, 324, <re.Match object; span=(573, 573), match=''>, 'ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGC', 'forward') <class 'tuple'> <re.Match object; span=(573, 573), match=''> 324
Protein sequence: MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS
('ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGC', 'MTTIRDS
<details>
<summary>英文:</summary>
my code is similar to yours, but uses Python Max() built-in function;
it is copied from [bioinformatics.stackexchange.com : Find open reading frames in a DNA sequence][1]

import re

from Bio.Seq import Seq

sequence = ('GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCGGTGGCGAAA'
'CCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG'
'TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGAAGCGTGGC'
'TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG'
'CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA'
'AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG'
'ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT'
'CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT'
'GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG'
'CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA'
'CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG'
'CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA'
'CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA'
'GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG'
'CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG'
'ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA'
'ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC'
'GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG'
'CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG'
'CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT')

print(sequence,'\n\n')

def find_orfs(sequence):
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

revcompseq = sequence[::-1].maketrans(&quot;ATGC&quot;, &quot;TACG&quot;) #reverse complement
# print (pattern.findall(sequence)) #forward search
# print (pattern.findall(sequence[::-1].translate(revcompseq))) #backward search
b = [(m.span(1), m.start(1), m.end(1), (m.end(1)-m.start(1)), m, m.groups()[0] , &#39;forward&#39;) for m in re.finditer(pattern, sequence)]
b_rev = [(m.span(1), m.start(1), m.end(1), (m.end(1)-m.start(1)), m, m.groups()[0] , &#39;reverse&#39;) for m in re.finditer(pattern, sequence[::-1].translate(revcompseq))]
b = max(b, key= lambda x: x[3])
try: 
b_rev = max(b_rev, key= lambda x: x[3])
except:
b_rev = (0,0,0,0,0,0)
b_max = max((b,b_rev), key= lambda x: x[3])
# print(b,&#39;\n&#39;, b_rev,&#39;\n&#39;)
print(b_max, type(b), b[4], len(b[5]))
protein_seq = Seq(b_max[5]).translate()
print(&quot;Protein sequence:&quot;, protein_seq)
protein_seq = str(protein_seq)
return(b_max[5], protein_seq)

print('\n\n',find_orfs(sequence))


output:

((573, 897), 573, 897, 324, <re.Match object; span=(573, 573), match=''>, 'ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGC', 'forward') <class 'tuple'> <re.Match object; span=(573, 573), match=''> 324
Protein sequence: MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS

('ATGACGACCATCAGGGACAGCTTCAACGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCGCACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGCTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGC', 'MTTIRDSFNGSYQPNFDHWTADRHGDLCRTWTRCWRFSIGSAPLTSITNKSEVAKPDRTIKIPGVSPWKRSPVPTLPLTGYLSAFLPSGFLNAHAVGISVRCRSFAPS')


I an not sure if using `re.Match object` attributes it would be faster than your algorithm for big sequences, my code has a glitch: it assumes that there is only one ORF of the biggest lenght, as your algo too I believe.
if used with:

sequence = ('ATGAAAAAAAAAAAAAAAAATGTAG'
'ATGAAAAAAAAAAAAAAAAAATAG')


it returns the first of the 2 same lenght ORF of the sequence
[1]: https://bioinformatics.stackexchange.com/questions/20442/find-open-reading-frames-in-a-dna-sequence/20452#20452
</details>

huangapple
  • 本文由 发表于 2023年3月9日 22:46:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75686169.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定