英文:
How to download multiple sequences in one fasta file from UniProt using Python 3
问题
我创建了一个用于从Uniprot下载蛋白质序列的Python脚本,以fasta格式保存。该脚本将从包含访问号(每行一个)的文本文件中读取访问号,然后尝试从UniProt数据库中下载相应的序列。以下是该脚本的内容:
import requests
with open('testfasta.txt', 'r') as infile:
lines = infile.readlines()
count = 0
for line in lines:
count += 1
line = line.strip()
access_id = line
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1 + access_id + url_part2
response = requests.get(URL)
with open((access_id) + ".fa", "wb") as txtFile:
txtFile.write(response.content)
print("Total sequences downloaded =", count)
这个脚本工作得很好,但对于数百个序列来说,它会生成大量的文件。因此,有必要将下一个序列写在第一个序列下面,然后在它下面写第二个序列,依此类推。fasta文件格式基本上是一个包含带有“>”标记的标题的文本文件。
例如:
>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh
等等。
英文:
I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:
import requests
with open ('testfasta.txt', 'r') as infile:
lines = infile.readlines()
count = 0
for line in lines:
count+=1
line = line.strip()
access_id = line
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1+access_id+url_part2
response = requests.get (URL)
with open((access_id)+".fa", "wb") as txtFile:
txtFile.write(response.content)
print ("Total sequences downloaded = ", count)
This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">".
e.g.
>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh
and so on
答案1
得分: 0
以下是代码的翻译部分:
import requests
# 打开输入文件和输出文件
with open('testfasta.txt', 'r') as infile,
open('results.fasta', 'w') as outfile:
for count, line in enumerate(infile, 1):
access_id = line.strip()
response = requests.get(
f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
# 检查获取是否成功;如果失败则引发错误
response.raise_for_status()
assert(response.text.startswith('>'))
assert(response.text.endswith('\n'))
outfile.write(response.text)
print(f"下载的总序列数 = {count}")
请注意,这段代码假定获取的数据是以换行符结束的,并且在序列本身之前包括FASTA标题。如果这不一定总是正确的情况,也许可以用代码来解决这些问题,而不是使用assert。另外要注意的是,下载的response.content是字节而不是文本,如果需要的话,你可以使用decode来处理它,但是Requests已经为你完成了这个操作,并将结果提供在response.text中。
英文:
Something like this? Just write them all to the same file.
import requests
with open('testfasta.txt', 'r') as infile,
open('results.fasta', 'w') as outfile:
for count, line in enumerate(infile, 1):
access_id = line.strip()
response = requests.get(
f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
# check that fetch succeeded; raise error if not
response.raise_for_status()
assert(response.text.startswith('>'))
assert(response.text.endswith('\n'))
outfile.write(response.text)
print (f"Total sequences downloaded = {count}")
This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the asserts with code to fix any such problems. I also made various changes to make it more idiomatic.
A vague complication is that the response.content you download is not text, but bytes. You could decode it if you wanted to, but of course, Requests already does this for you, and provides that in response.text
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论