英文:
How to download multiple sequences in one fasta file from UniProt using Python 3
问题
我创建了一个用于从Uniprot下载蛋白质序列的Python脚本,以fasta格式保存。该脚本将从包含访问号(每行一个)的文本文件中读取访问号,然后尝试从UniProt数据库中下载相应的序列。以下是该脚本的内容:
import requests
with open('testfasta.txt', 'r') as infile:
lines = infile.readlines()
count = 0
for line in lines:
count += 1
line = line.strip()
access_id = line
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1 + access_id + url_part2
response = requests.get(URL)
with open((access_id) + ".fa", "wb") as txtFile:
txtFile.write(response.content)
print("Total sequences downloaded =", count)
这个脚本工作得很好,但对于数百个序列来说,它会生成大量的文件。因此,有必要将下一个序列写在第一个序列下面,然后在它下面写第二个序列,依此类推。fasta文件格式基本上是一个包含带有“>”标记的标题的文本文件。
例如:
>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh
等等。
英文:
I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:
import requests
with open ('testfasta.txt', 'r') as infile:
lines = infile.readlines()
count = 0
for line in lines:
count+=1
line = line.strip()
access_id = line
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1+access_id+url_part2
response = requests.get (URL)
with open((access_id)+".fa", "wb") as txtFile:
txtFile.write(response.content)
print ("Total sequences downloaded = ", count)
This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">".
e.g.
>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh
and so on
答案1
得分: 0
以下是代码的翻译部分:
import requests
# 打开输入文件和输出文件
with open('testfasta.txt', 'r') as infile,
open('results.fasta', 'w') as outfile:
for count, line in enumerate(infile, 1):
access_id = line.strip()
response = requests.get(
f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
# 检查获取是否成功;如果失败则引发错误
response.raise_for_status()
assert(response.text.startswith('>'))
assert(response.text.endswith('\n'))
outfile.write(response.text)
print(f"下载的总序列数 = {count}")
请注意,这段代码假定获取的数据是以换行符结束的,并且在序列本身之前包括FASTA标题。如果这不一定总是正确的情况,也许可以用代码来解决这些问题,而不是使用assert
。另外要注意的是,下载的response.content
是字节而不是文本,如果需要的话,你可以使用decode
来处理它,但是Requests已经为你完成了这个操作,并将结果提供在response.text
中。
英文:
Something like this? Just write them all to the same file.
import requests
with open('testfasta.txt', 'r') as infile,
open('results.fasta', 'w') as outfile:
for count, line in enumerate(infile, 1):
access_id = line.strip()
response = requests.get(
f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
# check that fetch succeeded; raise error if not
response.raise_for_status()
assert(response.text.startswith('>'))
assert(response.text.endswith('\n'))
outfile.write(response.text)
print (f"Total sequences downloaded = {count}")
This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the assert
s with code to fix any such problems. I also made various changes to make it more idiomatic.
A vague complication is that the response.content
you download is not text, but bytes
. You could decode
it if you wanted to, but of course, Requests already does this for you, and provides that in response.text
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论