问题

I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.

This is my input:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

The current script:

$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;

$end= $#data;
for ($line= 0; $line<= $end; $line++){
    if ($data[$line] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

And its OUTPUT:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

As you can see, there's an issue with this method.

How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?

The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (so it can be done without a regex match to the word). I'm clueless on how it can be done though. I will really appreciate your input!

英文:

I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.

This is my input:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

The current script:

$genbank = &lt;STDIN&gt;;
chomp ($genbank);
open (READ, &quot;&lt;$genbank&quot;) or die;
@data = &lt;READ&gt;;
close READ;

$end= $#data;
for ($line= 0; $line&lt;= $end; $line++){
    if ($data[$line] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

And its OUTPUT:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

As you can see, there's an issue with this method.

How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?

答案1

得分: 1

那看起来比必要的要复杂得多。我会选择类似以下的东西：

use strict;
use warnings;

my $print = 0;
open(IN, '<', 'genbank.txt');
open(OUT, ">genbank-out-without-comment.txt") or die "opsala $!";
while(<IN>){
  if(/^LOCUS\s/){
    $print = 1;
  }
  if(/^COMMENT\s/i){
    print "\n"; # 保留条目之间的换行符
    $print = 0;
  }
  print OUT if $print;
}
close(OUT);

英文:

That looks a lot more complicated than it needs to be. I'd go with something like:

use strict;
use warnings;

my $print = 0;
open(IN, &#39;&lt;&#39;, &#39;genbank.txt&#39;);
open(OUT, &quot;&gt;genbank-out-without-comment.txt&quot;) or die &quot;opsala $!&quot;;
while(&lt;IN&gt;){
  if(/^LOCUS\s/){
    $print = 1;
  }
  if(/^COMMENT\s/i){
    print &quot;\n&quot;; # preserve new line between entries
    $print = 0;
  }
  print OUT if $print;
}
close(OUT);

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Perl在GenBank中检索数据直到关键字？

问题

答案1

如何在Perl/HTML中组合多个显示条件以创建多项选择调查。

Perl: 强制 Spreadsheet::Read 使用 Text::CSV_XS

Perl的GetFiles函数在使用MBCS（日文）字符时返回问号。

“Insecure dependency in eval while running with -T switch on perl and ubuntu 16.04.”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论