2023年7月17日 13:13:05go评论161阅读模式

英文:

How to download multiple sequences in one fasta file from UniProt using Python 3

问题

我创建了一个用于从Uniprot下载蛋白质序列的Python脚本，以fasta格式保存。该脚本将从包含访问号（每行一个）的文本文件中读取访问号，然后尝试从UniProt数据库中下载相应的序列。以下是该脚本的内容：

import requests
with open('testfasta.txt', 'r') as infile:
    lines = infile.readlines()
count = 0
for line in lines:
    count += 1
    line = line.strip()
    access_id = line
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'
    URL = url_part1 + access_id + url_part2
    response = requests.get(URL)
    with open((access_id) + ".fa", "wb") as txtFile:
        txtFile.write(response.content)
print("Total sequences downloaded =", count)

这个脚本工作得很好，但对于数百个序列来说，它会生成大量的文件。因此，有必要将下一个序列写在第一个序列下面，然后在它下面写第二个序列，依此类推。fasta文件格式基本上是一个包含带有“>”标记的标题的文本文件。

例如：

>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh

等等。

英文:

I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:

import requests
with open (&#39;testfasta.txt&#39;, &#39;r&#39;) as infile:
    lines = infile.readlines()
count = 0
for line in lines:
    count+=1
    line = line.strip()
    access_id = line
    url_part1 = &#39;https://rest.uniprot.org/uniprotkb/&#39;
    url_part2 = &#39;.fasta&#39;
    URL = url_part1+access_id+url_part2
              
    response = requests.get (URL)
              
    with open((access_id)+&quot;.fa&quot;, &quot;wb&quot;) as txtFile:
        txtFile.write(response.content)
print (&quot;Total sequences downloaded = &quot;, count)

This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">".
e.g.

&gt;firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
&gt;secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
&gt;nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh

and so on

答案1

得分: 0

以下是代码的翻译部分：

import requests
# 打开输入文件和输出文件
with open('testfasta.txt', 'r') as infile,
     open('results.fasta', 'w') as outfile:
  for count, line in enumerate(infile, 1):
    access_id = line.strip()              
    response = requests.get(
      f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
    # 检查获取是否成功；如果失败则引发错误
    response.raise_for_status()
    assert(response.text.startswith('>'))
    assert(response.text.endswith('\n'))
    outfile.write(response.text)
print(f"下载的总序列数 = {count}")

请注意，这段代码假定获取的数据是以换行符结束的，并且在序列本身之前包括FASTA标题。如果这不一定总是正确的情况，也许可以用代码来解决这些问题，而不是使用assert。另外要注意的是，下载的response.content是字节而不是文本，如果需要的话，你可以使用decode来处理它，但是Requests已经为你完成了这个操作，并将结果提供在response.text中。

英文:

Something like this? Just write them all to the same file.

import requests
with open(&#39;testfasta.txt&#39;, &#39;r&#39;) as infile,
     open(&#39;results.fasta&#39;, &#39;w&#39;) as outfile:
  for count, line in enumerate(infile, 1):
    access_id = line.strip()              
    response = requests.get(
      f&#39;https://rest.uniprot.org/uniprotkb/{access_id}.fasta&#39;)
    # check that fetch succeeded; raise error if not
    response.raise_for_status()
    assert(response.text.startswith(&#39;&gt;&#39;))
    assert(response.text.endswith(&#39;\n&#39;))
    outfile.write(response.text)
print (f&quot;Total sequences downloaded = {count}&quot;)

This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the asserts with code to fix any such problems. I also made various changes to make it more idiomatic.

A vague complication is that the response.content you download is not text, but bytes. You could decode it if you wanted to, but of course, Requests already does this for you, and provides that in response.text

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python 3从UniProt下载一个fasta文件中的多个序列

问题

答案1

为什么我的Python-Requests脚本在使用URL列表时不断下载相同的页面？

显示pandas中的重复项

在Python中选择字典中的一个子键。

Python比.apply()更高效的方法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。