英文:
How to retrieve data until a keyword in GenBank with Perl?
问题
I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.
This is my input:
LOCUS mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
sequence 1..19524 reannotated via EnsEMBL
ACCESSION primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION mitochondrion_genomeBDGP6.32
KEYWORDS .
SOURCE fruit fly
ORGANISM Drosophila melanogaster
Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
Drosophilidae; Drosophilinae.
COMMENT This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
http://www.ensemblgenomes.org/ for more information.
The current script:
$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;
$end= $#data;
for ($line= 0; $line<= $end; $line++){
if ($data[$line] =~ /LOCUS/){
@annotation = (@annotation, $data[$line]);
until ($data[$line] =~ /COMMENT/){
$line++;
@annotation = (@annotation, $data[$line]);
}}}
print @annotation;
And its OUTPUT:
LOCUS mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
sequence 1..19524 reannotated via EnsEMBL
ACCESSION primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION mitochondrion_genomeBDGP6.32
KEYWORDS .
SOURCE fruit fly
ORGANISM Drosophila melanogaster
Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
Drosophilidae; Drosophilinae.
COMMENT This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
As you can see, there's an issue with this method.
How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?
The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (so it can be done without a regex match to the word). I'm clueless on how it can be done though. I will really appreciate your input!
英文:
I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.
This is my input:
LOCUS mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
sequence 1..19524 reannotated via EnsEMBL
ACCESSION primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION mitochondrion_genomeBDGP6.32
KEYWORDS .
SOURCE fruit fly
ORGANISM Drosophila melanogaster
Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
Drosophilidae; Drosophilinae.
COMMENT This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
http://www.ensemblgenomes.org/ for more information.
The current script:
$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;
$end= $#data;
for ($line= 0; $line<= $end; $line++){
if ($data[$line] =~ /LOCUS/){
@annotation = (@annotation, $data[$line]);
until ($data[$line] =~ /COMMENT/){
$line++;
@annotation = (@annotation, $data[$line]);
}}}
print @annotation;
And its OUTPUT:
LOCUS mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
sequence 1..19524 reannotated via EnsEMBL
ACCESSION primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION mitochondrion_genomeBDGP6.32
KEYWORDS .
SOURCE fruit fly
ORGANISM Drosophila melanogaster
Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
Drosophilidae; Drosophilinae.
COMMENT This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
As you can see, there's an issue with this method.
How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?
The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (so it can be done without a regex match to the word). I'm clueless on how it can be done though. I will really appreciate your input!!
答案1
得分: 1
那看起来比必要的要复杂得多。我会选择类似以下的东西:
use strict;
use warnings;
my $print = 0;
open(IN, '<', 'genbank.txt');
open(OUT, ">genbank-out-without-comment.txt") or die "opsala $!";
while(<IN>){
if(/^LOCUS\s/){
$print = 1;
}
if(/^COMMENT\s/i){
print "\n"; # 保留条目之间的换行符
$print = 0;
}
print OUT if $print;
}
close(OUT);
英文:
That looks a lot more complicated than it needs to be. I'd go with something like:
use strict;
use warnings;
my $print = 0;
open(IN, '<', 'genbank.txt');
open(OUT, ">genbank-out-without-comment.txt") or die "opsala $!";
while(<IN>){
if(/^LOCUS\s/){
$print = 1;
}
if(/^COMMENT\s/i){
print "\n"; # preserve new line between entries
$print = 0;
}
print OUT if $print;
}
close(OUT);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论