2023年7月27日 15:49:59go评论84阅读模式

英文:

Errors in removing duplicate sequences fasta file - problem in the header

问题

我正在尝试将一些fasta格式的蛋白质序列合并，并删除重复的序列。我通过搜索找到了以下代码，它运行得很好，但我遇到了一个我无法理解的问题。以下是导致错误的示例序列：

>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

原始文件和序列很长，所以我缩短了它们以方便阅读。

我在论坛上找到了这段代码，它可以正常工作并写入一个没有重复序列的新文件：

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

for record in SeqIO.parse("Prob2.fa", "fasta"):
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#写入fasta文件
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()

print(f"运行时间为{(end- start)/60}")

现在，Python解释器给我报错：

Traceback (most recent call last):
  File "C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py", line 10, in <module>
    for record in SeqIO.parse("Prob2.fa", "fasta"):
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 72, in __next__
    return next(self.records)
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 238, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 50, in SimpleFastaParser
    for line in handle:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 449: illegal multibyte sequence

我发现问题出在标题中的“-”字符，写成了“- in claim”（列表中的第三个序列）。如果我删除它，代码就可以正常工作，但其他序列标题中也有其他“-”。我通过删除一半的序列并检查是否仍然报错来发现这个问题。现在，如果我删除这个“-”并输入一个新的“-”，代码就可以正常工作。所以我只是想弄清楚这里的真正问题是什么，这样我将来就可以按照正确的输入格式进行编写。

我最初是在Word中编写这些序列，然后在Notepad++中进行编辑，并将其保存为“.fa”文件。

其次，我想找出找到了多少个重复项，并提及记录的ID/标题。如果有人可以帮我确定应该插入哪些代码行，我将非常感激。

英文:

I am trying to combine some protein sequences in fasta format and then remove duplicates. I found this code by searching and it works well enough but I ran into an issue that I couldn't understand. Here is the example sequence which is causing the error:

&gt;someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
&gt;thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

The original file and sequences are long so I shortened it for ease.

I found this code on the forum which works fine and writes a new file without duplicates:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

for record in SeqIO.parse(&quot;Prob2.fa&quot;, &quot;fasta&quot;):
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, &quot;Checked.fa&quot;, &quot;fasta&quot;)
end = time.time()

print(f&quot;Run time is {(end- start)/60}&quot;)

Now, the Python interpreter is giving me this error:

Traceback (most recent call last):
  File &quot;C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py&quot;, line 10, in &lt;module&gt;
    for record in SeqIO.parse(&quot;Prob2.fa&quot;, &quot;fasta&quot;):
  File &quot;C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py&quot;, line 72, in __next__
    return next(self.records)
  File &quot;C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py&quot;, line 238, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File &quot;C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py&quot;, line 50, in SimpleFastaParser
    for line in handle:
UnicodeDecodeError: &#39;gbk&#39; codec can&#39;t decode byte 0x93 in position 449: illegal multibyte sequence

I found the problem is in the header with "-" character written as "- in claim" (the third sequence in list). If I remove that it works fine, but there are other "-" mentioned in other sequence headers as well. I found it by removing half of the sequences and checking if it still gives an error. Now, if I delete this "-" and type a new "-", it works fine. So I am just trying to understand what is the real problem here. So I can write in the correct input format in the future.

I originally wrote these sequences in Word, and later edit them in Notepad++ and save it as ".fa" file.

Secondly, I want to find out how many duplicates were found and mention the record IDs/headers. So if someone can help me with what lines of codes I should insert, I will be very grateful.

答案1

得分: 0

好的，我会为你翻译这段代码。以下是翻译的结果：

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

filename = 'Prob2.fa'

with open(filename, 'r', encoding='utf-8') as f:
    
    for record in SeqIO.parse(f, "fasta"):
        if str(record.seq) not in seen:
            seen.append(str(record.seq))
            records.append(record)


#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()

print(f"运行时间为 {(end- start)/60}")

请尝试使用上述代码，并告诉我是否有效。

我可以通过在代码中添加以下内容来重现你的错误：

with open(filename, 'r', encoding='gbk') as f:

并在其中一个标题中添加字符丆。

但是，如果我从FASTA标题中删除丆，则不再出现错误。

正如Poshi指出的那样：

这看起来像是一个编码问题。不确定为什么数据正在使用GBK解码器进行解码。

请参阅https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559，了解有关如何向SeqIO.parse(..提供数据的解释：

参数：

handle - 文件的句柄，或者作为字符串的文件名

......

如果你有一个包含文件内容的字符串data，你必须首先将其转换为句柄，以便解析它：

正如Poshi所说，这应该不是Biopython的问题，请尝试仅使用以下代码：

filename = 'Prob2.fa'

with open(filename, 'r', encoding='utf-8') as f: #或者使用 encoding='gbk'
    
    print(f.read())

在同一个文件上运行，并查看是否出现相同的错误。

英文:

OK my attempt , cannot reproduce your error.
But using your same input:

&gt;someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
&gt;thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
&gt;thirdone - in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

try with the following code:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

filename = &#39;Prob2.fa&#39;

with open(filename, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f:
    
    for record in SeqIO.parse(f, &quot;fasta&quot;):
        if str(record.seq) not in seen:
            seen.append(str(record.seq))
            records.append(record)


#writing to a fasta file
SeqIO.write(records, &quot;Checked.fa&quot;, &quot;fasta&quot;)
end = time.time()

print(f&quot;Run time is {(end- start)/60}&quot;)

let us know if it is working.

I can reproduce your error using in my code:

with open(filename, &#39;r&#39;, encoding=&#39;gbk&#39;) as f:

adding the char : 丆 to one of your headers

but I dont get the error anymore if I delete the 丆 from the fasta header

As Poshi pointed out:

> This looks like an encoding issue. Not sure why the data is being decoded with the GBK decoder.

SEE https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559 for explanation about :

how to feed data to SEqIO.parse(.. :

> Arguments:
- handle - handle to the file, or the filename as a string

> .......

> If you have a string 'data' containing the file contents, you must
first turn this into a handle in order to parse it:

As Poshi said, it should not be a Biopython issue, try with just:

filename = &#39;Prob2.fa&#39;

with open(filename, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f: #or encoding=&#39;gbk&#39; 
    
    print(f.read())

on the same file and see if you get same error

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在删除重复序列的fasta文件时出现错误 – 标头存在问题

问题

答案1

如何使用增量步长创建一个numpy.arange？

在Golang中以压缩的二进制格式存储矩阵。

验证数值

如何记录Python代码的内存消耗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论