在删除重复序列的fasta文件时出现错误 – 标头存在问题

huangapple go评论84阅读模式
英文:

Errors in removing duplicate sequences fasta file - problem in the header

问题

我正在尝试将一些fasta格式的蛋白质序列合并,并删除重复的序列。我通过搜索找到了以下代码,它运行得很好,但我遇到了一个我无法理解的问题。以下是导致错误的示例序列:

>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

原始文件和序列很长,所以我缩短了它们以方便阅读。

我在论坛上找到了这段代码,它可以正常工作并写入一个没有重复序列的新文件:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

for record in SeqIO.parse("Prob2.fa", "fasta"):
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#写入fasta文件
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()

print(f"运行时间为{(end- start)/60}")

现在,Python解释器给我报错:

Traceback (most recent call last):
  File "C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py", line 10, in <module>
    for record in SeqIO.parse("Prob2.fa", "fasta"):
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 72, in __next__
    return next(self.records)
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 238, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 50, in SimpleFastaParser
    for line in handle:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 449: illegal multibyte sequence

我发现问题出在标题中的“-”字符,写成了“- in claim”(列表中的第三个序列)。如果我删除它,代码就可以正常工作,但其他序列标题中也有其他“-”。我通过删除一半的序列并检查是否仍然报错来发现这个问题。现在,如果我删除这个“-”并输入一个新的“-”,代码就可以正常工作。所以我只是想弄清楚这里的真正问题是什么,这样我将来就可以按照正确的输入格式进行编写。

我最初是在Word中编写这些序列,然后在Notepad++中进行编辑,并将其保存为“.fa”文件。

其次,我想找出找到了多少个重复项,并提及记录的ID/标题。如果有人可以帮我确定应该插入哪些代码行,我将非常感激。

英文:

I am trying to combine some protein sequences in fasta format and then remove duplicates. I found this code by searching and it works well enough but I ran into an issue that I couldn't understand. Here is the example sequence which is causing the error:

&gt;someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
&gt;thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

The original file and sequences are long so I shortened it for ease.

I found this code on the forum which works fine and writes a new file without duplicates:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

for record in SeqIO.parse(&quot;Prob2.fa&quot;, &quot;fasta&quot;):
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, &quot;Checked.fa&quot;, &quot;fasta&quot;)
end = time.time()

print(f&quot;Run time is {(end- start)/60}&quot;)

Now, the Python interpreter is giving me this error:

Traceback (most recent call last):
  File &quot;C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py&quot;, line 10, in &lt;module&gt;
    for record in SeqIO.parse(&quot;Prob2.fa&quot;, &quot;fasta&quot;):
  File &quot;C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py&quot;, line 72, in __next__
    return next(self.records)
  File &quot;C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py&quot;, line 238, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File &quot;C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py&quot;, line 50, in SimpleFastaParser
    for line in handle:
UnicodeDecodeError: &#39;gbk&#39; codec can&#39;t decode byte 0x93 in position 449: illegal multibyte sequence

I found the problem is in the header with "-" character written as "- in claim" (the third sequence in list). If I remove that it works fine, but there are other "-" mentioned in other sequence headers as well. I found it by removing half of the sequences and checking if it still gives an error. Now, if I delete this "-" and type a new "-", it works fine. So I am just trying to understand what is the real problem here. So I can write in the correct input format in the future.

I originally wrote these sequences in Word, and later edit them in Notepad++ and save it as ".fa" file.

Secondly, I want to find out how many duplicates were found and mention the record IDs/headers. So if someone can help me with what lines of codes I should insert, I will be very grateful.

答案1

得分: 0

好的,我会为你翻译这段代码。以下是翻译的结果:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

filename = 'Prob2.fa'

with open(filename, 'r', encoding='utf-8') as f:
    
    for record in SeqIO.parse(f, "fasta"):
        if str(record.seq) not in seen:
            seen.append(str(record.seq))
            records.append(record)


#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()

print(f"运行时间为 {(end- start)/60}")

请尝试使用上述代码,并告诉我是否有效。

我可以通过在代码中添加以下内容来重现你的错误:

with open(filename, 'r', encoding='gbk') as f:

并在其中一个标题中添加字符

但是,如果我从FASTA标题中删除,则不再出现错误。

正如Poshi指出的那样:

这看起来像是一个编码问题。不确定为什么数据正在使用GBK解码器进行解码。

请参阅https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559,了解有关如何向SeqIO.parse(..提供数据的解释:

参数:

  • handle - 文件的句柄,或者作为字符串的文件名

......

如果你有一个包含文件内容的字符串data,你必须首先将其转换为句柄,以便解析它:

正如Poshi所说,这应该不是Biopython的问题,请尝试仅使用以下代码:

filename = 'Prob2.fa'

with open(filename, 'r', encoding='utf-8') as f: #或者使用 encoding='gbk'
    
    print(f.read())

在同一个文件上运行,并查看是否出现相同的错误。

英文:

OK my attempt , cannot reproduce your error.
But using your same input:

&gt;someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
&gt;thirdone - in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

try with the following code:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

filename = &#39;Prob2.fa&#39;

with open(filename, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f:
    
    for record in SeqIO.parse(f, &quot;fasta&quot;):
        if str(record.seq) not in seen:
            seen.append(str(record.seq))
            records.append(record)


#writing to a fasta file
SeqIO.write(records, &quot;Checked.fa&quot;, &quot;fasta&quot;)
end = time.time()

print(f&quot;Run time is {(end- start)/60}&quot;)

let us know if it is working.

I can reproduce your error using in my code:

with open(filename, &#39;r&#39;, encoding=&#39;gbk&#39;) as f:

adding the char : to one of your headers

but I dont get the error anymore if I delete the from the fasta header

As Poshi pointed out:

> This looks like an encoding issue. Not sure why the data is being decoded with the GBK decoder.

SEE https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559 for explanation about :

how to feed data to SEqIO.parse(.. :

> Arguments:
- handle - handle to the file, or the filename as a string

> .......

> If you have a string 'data' containing the file contents, you must
first turn this into a handle in order to parse it:

As Poshi said, it should not be a Biopython issue, try with just:

filename = &#39;Prob2.fa&#39;

with open(filename, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f: #or encoding=&#39;gbk&#39; 
    
    print(f.read())

on the same file and see if you get same error

huangapple
  • 本文由 发表于 2023年7月27日 15:49:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76777565.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定