读取基因对列表并为每一行写入一个fasta文件。

huangapple go评论62阅读模式
英文:

How to read a list of gene pairs and write a fasta file for each line

问题

I'm new to bioinformatics and would really appreciate some help!

I have a big multi-fasta file (genes.faa), like this:

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)

And a list of gene pairs (gene.pairs.txt), with two genes per line separated by a tab:

gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B

And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:

fasta1

>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

fasta2

>gene2_A 
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

fasta3

>gene56_A 
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line.

Thank you so much in advance for any help!

英文:

I'm new to bioinformatics and would really aprecciated some help!

I have a big multi-fasta file (genes.faa), like this:

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)

And a list of gene pairs (gene.pairs.txt), with two genes per line separeted by a tab:

gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B

And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:

fasta1

>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

fasta2

>gene2_A 
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

fasta3

>gene56_A 
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC

I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line.
Thank you so much in advance for any help!

答案1

得分: 0

我尝试,肯定有更快更好的方法来完成它,让我想知道是否有一种方法可以跳过创建大字典的步骤:sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}

我正在使用Biopython库来解析fasta文件并将它们写入文件https://biopython.org/https://github.com/biopython/biopython;无论如何:

输入 'big_fasta_2.fa' :

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAA
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAY
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAW
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAP

输入 "gene_pairs_3.txt":

gene13_A	gene33_B
gene1344_A	gene33_B
gene2_A	gene48_B
gene23333_A	gene48_B
gene56_A	gene2_B

代码:

from Bio import SeqIO,  __version__

print('Biopython version : ', __version__)

sequences = {i.id: i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}

print(sequences)

file = open("gene_pairs_3.txt", "r")

cnt = 1
for line in file:
    a, b = line.split()
    print('++++++++++++')
    print('pairs N°: ', cnt)
    print(a)
    print(b)

    if a in sequences:
        print('ok A')
        print(sequences[a])

        if b in sequences:
            print('ok B')
            print(sequences[b])

            SeqIO.write([sequences[a], sequences[b]], 'Fasta' + str(cnt) + '.fa', 'fasta')

            print('\nwritten file: ', 'Fasta' + str(cnt) + '.fa')

            cnt += 1
        else:
            print('No B')
    else:
        print('No A')
        continue

    print('-----------\n')

查看输出文件,看看它们是否符合你的期望。

英文:

My attempt, sure there are faster better ways to accomplish it, it makes me wonder if there could be a way to skip the creation of the big dictionary : sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}.

I am using the Biopython library to parse the fasta file and to write them https://biopython.org/ , https://github.com/biopython/biopython; anyway:

input 'big_fasta_2.fa' :

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAA
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAY
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAW
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAP

input "gene_pairs_3.txt":

gene13_A	gene33_B
gene1344_A	gene33_B
gene2_A	gene48_B
gene23333_A	gene48_B
gene56_A	gene2_B

code :

from Bio import SeqIO,  __version__

print('Biopython version : ', __version__)


sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}

print(sequences)


file = open("gene_pairs_3.txt","r")


cnt = 1
for line in file:
    
    
    a, b  =  line.split()
    print('++++++++++++')
    print('pairs N° : ', cnt)
    print(a)
    print(b)
    
    if a in sequences:
        print('ok A')
        print(sequences[a])
        
        if b in sequences:
            print('ok B')
            print(sequences[b])
            
            SeqIO.write([sequences[a],sequences[b]] , 'Fasta'+str(cnt)+'.fa' , 'fasta')
            
            print('\nwritten file : ' ,'Fasta'+str(cnt)+'.fa' )
            
            cnt += 1
        else:
            
            print('No B')
    
    else:
        print('No A')
        continue
    
    print('-----------\n')

Have a look at the output files and see if they are what you expected.

答案2

得分: 0

这段代码有效。我已经在以下文件上进行了测试:

输入的FASTA文件:

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4

基因列表文件:

gene1_A gene3_B
gene2_A gene4_B

代码:

from Bio import SeqIO

my_gene_file   = open('genelist.txt','r')
my_fasta_file  = open('input.fasta','r')

my_list=[]
for line in my_gene_file:
    line = line.strip(' \n')
    gene_id1, gene_id2 = line.split('\t')   
    gene_tuple = (gene_id1, gene_id2)
    my_list.append(gene_tuple)

my_dict={}
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):   
    my_dict[seq_record.id] = seq_record.seq

for item in my_list:
    if item[0] in my_dict and item[1] in my_dict:
        output_file_name=(f'{item[0]}_{item[1]}.fasta')
        f = open(output_file_name, "w")
        f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]}')

它会在文件名中输出具有FASTA记录ID的文件。

英文:

This code works. I have tested on the following files:

Input FASTA file:

>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4

Gene List file:

gene1_A	gene3_B
gene2_A	gene4_B

Code:

from Bio import SeqIO

my_gene_file   = open('genelist.txt','r')
my_fasta_file  = open('input.fasta','r')

my_list=[]
for line in my_gene_file:
    line = line.strip(' \n')
    gene_id1, gene_id2 = line.split('\t')   
    gene_tuple = (gene_id1, gene_id2)
    my_list.append(gene_tuple)

my_dict={}
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):   
    my_dict[seq_record.id] = seq_record.seq

for item in my_list:
    if item[0] in my_dict and item[1] in my_dict:
        output_file_name=(f'{item[0]}_{item[1]}.fasta')
        f = open(output_file_name, "w")
        f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]]}')

It outputs files with fasta record ids in the filenames.

huangapple
  • 本文由 发表于 2023年2月8日 23:17:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75387876.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定