英文:
How to read a list of gene pairs and write a fasta file for each line
问题
I'm new to bioinformatics and would really appreciate some help!
I have a big multi-fasta file (genes.faa), like this:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)
And a list of gene pairs (gene.pairs.txt), with two genes per line separated by a tab:
gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B
And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:
fasta1
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta2
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta3
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line.
Thank you so much in advance for any help!
英文:
I'm new to bioinformatics and would really aprecciated some help!
I have a big multi-fasta file (genes.faa), like this:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)
And a list of gene pairs (gene.pairs.txt), with two genes per line separeted by a tab:
gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B
And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:
fasta1
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta2
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta3
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line.
Thank you so much in advance for any help!
答案1
得分: 0
我尝试,肯定有更快更好的方法来完成它,让我想知道是否有一种方法可以跳过创建大字典的步骤:sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}
。
我正在使用Biopython库来解析fasta文件并将它们写入文件https://biopython.org/,https://github.com/biopython/biopython;无论如何:
输入 'big_fasta_2.fa'
:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAA
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAY
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAW
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAP
输入 "gene_pairs_3.txt"
:
gene13_A gene33_B
gene1344_A gene33_B
gene2_A gene48_B
gene23333_A gene48_B
gene56_A gene2_B
代码:
from Bio import SeqIO, __version__
print('Biopython version : ', __version__)
sequences = {i.id: i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}
print(sequences)
file = open("gene_pairs_3.txt", "r")
cnt = 1
for line in file:
a, b = line.split()
print('++++++++++++')
print('pairs N°: ', cnt)
print(a)
print(b)
if a in sequences:
print('ok A')
print(sequences[a])
if b in sequences:
print('ok B')
print(sequences[b])
SeqIO.write([sequences[a], sequences[b]], 'Fasta' + str(cnt) + '.fa', 'fasta')
print('\nwritten file: ', 'Fasta' + str(cnt) + '.fa')
cnt += 1
else:
print('No B')
else:
print('No A')
continue
print('-----------\n')
查看输出文件,看看它们是否符合你的期望。
英文:
My attempt, sure there are faster better ways to accomplish it, it makes me wonder if there could be a way to skip the creation of the big dictionary : sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}
.
I am using the Biopython library to parse the fasta file and to write them https://biopython.org/ , https://github.com/biopython/biopython; anyway:
input 'big_fasta_2.fa'
:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAA
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAY
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAW
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAP
input "gene_pairs_3.txt"
:
gene13_A gene33_B
gene1344_A gene33_B
gene2_A gene48_B
gene23333_A gene48_B
gene56_A gene2_B
code :
from Bio import SeqIO, __version__
print('Biopython version : ', __version__)
sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}
print(sequences)
file = open("gene_pairs_3.txt","r")
cnt = 1
for line in file:
a, b = line.split()
print('++++++++++++')
print('pairs N° : ', cnt)
print(a)
print(b)
if a in sequences:
print('ok A')
print(sequences[a])
if b in sequences:
print('ok B')
print(sequences[b])
SeqIO.write([sequences[a],sequences[b]] , 'Fasta'+str(cnt)+'.fa' , 'fasta')
print('\nwritten file : ' ,'Fasta'+str(cnt)+'.fa' )
cnt += 1
else:
print('No B')
else:
print('No A')
continue
print('-----------\n')
Have a look at the output files and see if they are what you expected.
答案2
得分: 0
这段代码有效。我已经在以下文件上进行了测试:
输入的FASTA文件:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4
基因列表文件:
gene1_A gene3_B
gene2_A gene4_B
代码:
from Bio import SeqIO
my_gene_file = open('genelist.txt','r')
my_fasta_file = open('input.fasta','r')
my_list=[]
for line in my_gene_file:
line = line.strip(' \n')
gene_id1, gene_id2 = line.split('\t')
gene_tuple = (gene_id1, gene_id2)
my_list.append(gene_tuple)
my_dict={}
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):
my_dict[seq_record.id] = seq_record.seq
for item in my_list:
if item[0] in my_dict and item[1] in my_dict:
output_file_name=(f'{item[0]}_{item[1]}.fasta')
f = open(output_file_name, "w")
f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]}')
它会在文件名中输出具有FASTA记录ID的文件。
英文:
This code works. I have tested on the following files:
Input FASTA file:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4
Gene List file:
gene1_A gene3_B
gene2_A gene4_B
Code:
from Bio import SeqIO
my_gene_file = open('genelist.txt','r')
my_fasta_file = open('input.fasta','r')
my_list=[]
for line in my_gene_file:
line = line.strip(' \n')
gene_id1, gene_id2 = line.split('\t')
gene_tuple = (gene_id1, gene_id2)
my_list.append(gene_tuple)
my_dict={}
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):
my_dict[seq_record.id] = seq_record.seq
for item in my_list:
if item[0] in my_dict and item[1] in my_dict:
output_file_name=(f'{item[0]}_{item[1]}.fasta')
f = open(output_file_name, "w")
f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]]}')
It outputs files with fasta record ids in the filenames.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论