如何使用Python 3从UniProt下载一个fasta文件中的多个序列

huangapple go评论112阅读模式
英文:

How to download multiple sequences in one fasta file from UniProt using Python 3

问题

我创建了一个用于从Uniprot下载蛋白质序列的Python脚本,以fasta格式保存。该脚本将从包含访问号(每行一个)的文本文件中读取访问号,然后尝试从UniProt数据库中下载相应的序列。以下是该脚本的内容:

import requests

with open('testfasta.txt', 'r') as infile:
    lines = infile.readlines()
count = 0
for line in lines:
    count += 1
    line = line.strip()
    access_id = line
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'

    URL = url_part1 + access_id + url_part2

    response = requests.get(URL)

    with open((access_id) + ".fa", "wb") as txtFile:
        txtFile.write(response.content)

print("Total sequences downloaded =", count)

这个脚本工作得很好,但对于数百个序列来说,它会生成大量的文件。因此,有必要将下一个序列写在第一个序列下面,然后在它下面写第二个序列,依此类推。fasta文件格式基本上是一个包含带有“>”标记的标题的文本文件。

例如:

>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh

等等。

英文:

I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:

import requests

with open ('testfasta.txt', 'r') as infile:
    lines = infile.readlines()
count = 0
for line in lines:
    count+=1
    line = line.strip()
    access_id = line
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'

    URL = url_part1+access_id+url_part2
              
    response = requests.get (URL)
              
    with open((access_id)+".fa", "wb") as txtFile:
        txtFile.write(response.content)

print ("Total sequences downloaded = ", count)

This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">".
e.g.

>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh

and so on

答案1

得分: 0

以下是代码的翻译部分:

import requests

# 打开输入文件和输出文件
with open('testfasta.txt', 'r') as infile,
     open('results.fasta', 'w') as outfile:
  for count, line in enumerate(infile, 1):
    access_id = line.strip()              
    response = requests.get(
      f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
    # 检查获取是否成功;如果失败则引发错误
    response.raise_for_status()
    assert(response.text.startswith('>'))
    assert(response.text.endswith('\n'))
    outfile.write(response.text)

print(f"下载的总序列数 = {count}")

请注意,这段代码假定获取的数据是以换行符结束的,并且在序列本身之前包括FASTA标题。如果这不一定总是正确的情况,也许可以用代码来解决这些问题,而不是使用assert。另外要注意的是,下载的response.content是字节而不是文本,如果需要的话,你可以使用decode来处理它,但是Requests已经为你完成了这个操作,并将结果提供在response.text中。

英文:

Something like this? Just write them all to the same file.

import requests

with open('testfasta.txt', 'r') as infile,
     open('results.fasta', 'w') as outfile:
  for count, line in enumerate(infile, 1):
    access_id = line.strip()              
    response = requests.get(
      f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
    # check that fetch succeeded; raise error if not
    response.raise_for_status()
    assert(response.text.startswith('>'))
    assert(response.text.endswith('\n'))
    outfile.write(response.text)

print (f"Total sequences downloaded = {count}")

This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the asserts with code to fix any such problems. I also made various changes to make it more idiomatic.

A vague complication is that the response.content you download is not text, but bytes. You could decode it if you wanted to, but of course, Requests already does this for you, and provides that in response.text

huangapple
  • 本文由 发表于 2023年7月17日 13:13:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76923267.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定