如何使用Perl在GenBank中检索数据直到关键字?

huangapple go评论58阅读模式
英文:

How to retrieve data until a keyword in GenBank with Perl?

问题

I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.

This is my input:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

The current script:

$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;

$end= $#data;
for ($line= 0; $line<= $end; $line++){
    if ($data[$line] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

And its OUTPUT:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

As you can see, there's an issue with this method.

How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?

The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (so it can be done without a regex match to the word). I'm clueless on how it can be done though. I will really appreciate your input!

英文:

I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.

This is my input:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

The current script:

$genbank = &lt;STDIN&gt;;
chomp ($genbank);
open (READ, &quot;&lt;$genbank&quot;) or die;
@data = &lt;READ&gt;;
close READ;

$end= $#data;
for ($line= 0; $line&lt;= $end; $line++){
    if ($data[$line] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

And its OUTPUT:


LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

As you can see, there's an issue with this method.

How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?

The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (so it can be done without a regex match to the word). I'm clueless on how it can be done though. I will really appreciate your input!!

答案1

得分: 1

那看起来比必要的要复杂得多。我会选择类似以下的东西:

use strict;
use warnings;

my $print = 0;
open(IN, '<', 'genbank.txt');
open(OUT, ">genbank-out-without-comment.txt") or die "opsala $!";
while(<IN>){
  if(/^LOCUS\s/){
    $print = 1;
  }
  if(/^COMMENT\s/i){
    print "\n"; # 保留条目之间的换行符
    $print = 0;
  }
  print OUT if $print;
}
close(OUT);
英文:

That looks a lot more complicated than it needs to be. I'd go with something like:

use strict;
use warnings;

my $print = 0;
open(IN, &#39;&lt;&#39;, &#39;genbank.txt&#39;);
open(OUT, &quot;&gt;genbank-out-without-comment.txt&quot;) or die &quot;opsala $!&quot;;
while(&lt;IN&gt;){
  if(/^LOCUS\s/){
    $print = 1;
  }
  if(/^COMMENT\s/i){
    print &quot;\n&quot;; # preserve new line between entries
    $print = 0;
  }
  print OUT if $print;
}
close(OUT);

huangapple
  • 本文由 发表于 2023年6月5日 19:10:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405846.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定