英文:
Errors in removing duplicate sequences fasta file - problem in the header
问题
我正在尝试将一些fasta格式的蛋白质序列合并,并删除重复的序列。我通过搜索找到了以下代码,它运行得很好,但我遇到了一个我无法理解的问题。以下是导致错误的示例序列:
>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR
原始文件和序列很长,所以我缩短了它们以方便阅读。
我在论坛上找到了这段代码,它可以正常工作并写入一个没有重复序列的新文件:
from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
for record in SeqIO.parse("Prob2.fa", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#写入fasta文件
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()
print(f"运行时间为{(end- start)/60}")
现在,Python解释器给我报错:
Traceback (most recent call last):
File "C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py", line 10, in <module>
for record in SeqIO.parse("Prob2.fa", "fasta"):
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 72, in __next__
return next(self.records)
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 238, in iterate
for title, sequence in SimpleFastaParser(handle):
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 50, in SimpleFastaParser
for line in handle:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 449: illegal multibyte sequence
我发现问题出在标题中的“-”字符,写成了“- in claim”(列表中的第三个序列)。如果我删除它,代码就可以正常工作,但其他序列标题中也有其他“-”。我通过删除一半的序列并检查是否仍然报错来发现这个问题。现在,如果我删除这个“-”并输入一个新的“-”,代码就可以正常工作。所以我只是想弄清楚这里的真正问题是什么,这样我将来就可以按照正确的输入格式进行编写。
我最初是在Word中编写这些序列,然后在Notepad++中进行编辑,并将其保存为“.fa”文件。
其次,我想找出找到了多少个重复项,并提及记录的ID/标题。如果有人可以帮我确定应该插入哪些代码行,我将非常感激。
英文:
I am trying to combine some protein sequences in fasta format and then remove duplicates. I found this code by searching and it works well enough but I ran into an issue that I couldn't understand. Here is the example sequence which is causing the error:
>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR
The original file and sequences are long so I shortened it for ease.
I found this code on the forum which works fine and writes a new file without duplicates:
from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
for record in SeqIO.parse("Prob2.fa", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()
print(f"Run time is {(end- start)/60}")
Now, the Python interpreter is giving me this error:
Traceback (most recent call last):
File "C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py", line 10, in <module>
for record in SeqIO.parse("Prob2.fa", "fasta"):
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 72, in __next__
return next(self.records)
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 238, in iterate
for title, sequence in SimpleFastaParser(handle):
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 50, in SimpleFastaParser
for line in handle:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 449: illegal multibyte sequence
I found the problem is in the header with "-" character written as "- in claim" (the third sequence in list). If I remove that it works fine, but there are other "-" mentioned in other sequence headers as well. I found it by removing half of the sequences and checking if it still gives an error. Now, if I delete this "-" and type a new "-", it works fine. So I am just trying to understand what is the real problem here. So I can write in the correct input format in the future.
I originally wrote these sequences in Word, and later edit them in Notepad++ and save it as ".fa" file.
Secondly, I want to find out how many duplicates were found and mention the record IDs/headers. So if someone can help me with what lines of codes I should insert, I will be very grateful.
答案1
得分: 0
好的,我会为你翻译这段代码。以下是翻译的结果:
from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
filename = 'Prob2.fa'
with open(filename, 'r', encoding='utf-8') as f:
for record in SeqIO.parse(f, "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()
print(f"运行时间为 {(end- start)/60}")
请尝试使用上述代码,并告诉我是否有效。
我可以通过在代码中添加以下内容来重现你的错误:
with open(filename, 'r', encoding='gbk') as f:
并在其中一个标题中添加字符丆
。
但是,如果我从FASTA标题中删除丆
,则不再出现错误。
正如Poshi指出的那样:
这看起来像是一个编码问题。不确定为什么数据正在使用GBK解码器进行解码。
请参阅https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559,了解有关如何向SeqIO.parse(..
提供数据的解释:
参数:
- handle - 文件的句柄,或者作为字符串的文件名
......
如果你有一个包含文件内容的字符串
data
,你必须首先将其转换为句柄,以便解析它:
正如Poshi所说,这应该不是Biopython的问题,请尝试仅使用以下代码:
filename = 'Prob2.fa'
with open(filename, 'r', encoding='utf-8') as f: #或者使用 encoding='gbk'
print(f.read())
在同一个文件上运行,并查看是否出现相同的错误。
英文:
OK my attempt , cannot reproduce your error.
But using your same input:
>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone - in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR
try with the following code:
from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
filename = 'Prob2.fa'
with open(filename, 'r', encoding='utf-8') as f:
for record in SeqIO.parse(f, "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()
print(f"Run time is {(end- start)/60}")
let us know if it is working.
I can reproduce your error using in my code:
with open(filename, 'r', encoding='gbk') as f:
adding the char : 丆
to one of your headers
but I dont get the error anymore if I delete the 丆
from the fasta header
As Poshi pointed out:
> This looks like an encoding issue. Not sure why the data is being decoded with the GBK decoder.
SEE https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559 for explanation about :
how to feed data to SEqIO.parse(..
:
> Arguments:
- handle - handle to the file, or the filename as a string
> .......
> If you have a string 'data' containing the file contents, you must
first turn this into a handle in order to parse it:
As Poshi said, it should not be a Biopython issue, try with just:
filename = 'Prob2.fa'
with open(filename, 'r', encoding='utf-8') as f: #or encoding='gbk'
print(f.read())
on the same file and see if you get same error
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论