如何使用Perl匹配两个连续的换行符

huangapple go评论59阅读模式
英文:

How to match two consecutive new lines using Perl

问题

这是您要翻译的内容:

I have a problem that I seem unable to solve, with a Perl program I coded to parse a particular results file.

Its aim is to capture two values from a table embedded in a `.txt` results file, along with many other lines of information.

Since this table doesn't have a fixed number of lines in every file, the only way I found to try to parse it, is detecting the two consecutive newlines after the table and then go "backward" to capture the values of interest placed in the last line.

Here is the regex I'm using unsuccessfully and an example of the results table.

$_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/)

Table (values of interest in bold+italic):

 | begin | 1,699,932 | 10,136.45 |
 |:---- |:------:| -----:|
 | 1 | 1,712,388 | 12,455.32 | 
 | 2 | 1,712,605 | 12,484.85 | 
 | 3 | ***1,712,611*** | ***12,513.51*** | 

I tried several regex tools online where my regex matches correctly, but once incorporated into my code, it just doesn't work...

Example of reproducible code:

        #!usr/bin/perl -w
        use strict;
        use Getopt::Long;

        my ($path);
        GetOptions(
            'path=s'          => $path,
              );

        chdir $path or die "ERROR: Unable to enter $path: $!\n";
        opendir (TEMP , ".");
        my @files = readdir (TEMP);
        closedir TEMP;

    for my $file (@files) {
        my $mAssize;
        my $qualAssize;

        if($file=~/(\w+)\_LRassembly.unicycler.log/){

              my$sample=$1;

              open(INFILE,"$file") or die ("ERROR: Unable to open Log to parse file $!\n");
              chomp(my @data = <INFILE>);

            for (@data) {

              if($_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/){

                print"Matched\n";
                        $mAssize=$1;
                        $qualAssize=$2;
                        print "MaxAssemblySize $mAssize\n";
                        print "QualAssembly $qualAssize\n";
                }
            }

            print OUT ("$sample\t$mAssize\t$qualAssize\n") or die ("ERROR: Unable to write log parsing file $!\n");

        }
      close INFILE;
    }

Find attached an example of a full results file containing the table:
[https://file.io/H8kCqE3gRov0][1]


  [1]: https://file.io/H8kCqE3gRov0

Sample of the output file:

        Polishing miniasm assembly with Racon (2023-07-08 00:32:20)
    -----------------------------------------------------------
        Unicycler now uses Racon to polish the miniasm assembly. It does multiple rounds of polishing to get the best consensus. Circular unitigs are rotated between rounds such that all parts (including the ends) are polished well.

    Saving to /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/polishing_reads.fastq:
      38,855 long reads

    Polish       Assembly          Mapping
    round            size          quality
    begin       1,671,271        29,207.18
    1           1,685,412        33,629.12
    2           1,685,573        33,654.73
    3           1,685,628        33,682.91

    Best polish: /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/016_rotated.fasta
    Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/13_racon_polished.gfa
    Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/003_racon_polished.gfa

The expected output would be ti print the following line (taking the result sample as guide):

| sample  | Assembly | Mapping |
|:---- |:------:| -----:|
| NKC1231 | 1,685,628 | 33,682.91| 
英文:

I have a problem that I seem unable to solve, with a Perl program I coded to parse a particular results file.

Its aim is to capture two values from a table embedded in a .txt results file, along with many other lines of information.

Since this table doesn't have a fixed number of lines in every file, the only way I found to try to parse it, is detecting the two consecutive newlines after the table and then go "backward" to capture the values of interest placed in the last line.

Here is the regex I'm using unsuccessfully and an example of the results table.

$_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/)

Table (values of interest in bold+italic):

begin 1,699,932 10,136.45
1 1,712,388 12,455.32
2 1,712,605 12,484.85
3 1,712,611 12,513.51

I tried several regex tools online where my regex matches correctly, but once incorporated into my code, it just doesn't work...

Example of reproducible code:

    #!usr/bin/perl -w
    use strict;
    use Getopt::Long;
    
    my ($path);
    GetOptions(
        &#39;path=s&#39;          =&gt; $path,
          );
        
    chdir $path or die &quot;ERROR: Unable to enter $path: $!\n&quot;;
    opendir (TEMP , &quot;.&quot;);
    my @files = readdir (TEMP);
    closedir TEMP;

for my $file (@files) {
    my $mAssize;
    my $qualAssize;

    if($file=~/(\w+)\_LRassembly.unicycler.log/){

          my$sample=$1;
          
          open(INFILE,&quot;$file&quot;) or die (&quot;ERROR: Unable to open Log to parse file $!\n&quot;);
          chomp(my @data = &lt;INFILE&gt;);

		for (@data) {
            
          if($_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/){
               
            print&quot;Matched\n&quot;;
                    $mAssize=$1;
                    $qualAssize=$2;
                    print &quot;MaxAssemblySize $mAssize\n&quot;;
                    print &quot;QualAssembly $qualAssize\n&quot;;
            }
		}
         
        print OUT (&quot;$sample\t$mAssize\t$qualAssize\n&quot;) or die (&quot;ERROR: Unable to write log parsing file $!\n&quot;);
     
	}
  close INFILE;
}

Find attached an example of a full results file containing the table:
https://file.io/H8kCqE3gRov0

Sample of the output file:

    Polishing miniasm assembly with Racon (2023-07-08 00:32:20)
-----------------------------------------------------------
    Unicycler now uses Racon to polish the miniasm assembly. It does multiple rounds of polishing to get the best consensus. Circular unitigs are rotated between rounds such that all parts (including the ends) are polished well.

Saving to /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/polishing_reads.fastq:
  38,855 long reads

Polish       Assembly          Mapping
round            size          quality
begin       1,671,271        29,207.18
1           1,685,412        33,629.12
2           1,685,573        33,654.73
3           1,685,628        33,682.91

Best polish: /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/016_rotated.fasta
Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/13_racon_polished.gfa
Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/003_racon_polished.gfa

The expected output would be ti print the following line (taking the result sample as guide):

sample Assembly Mapping
NKC1231 1,685,628 33,682.91

答案1

得分: 2

@data`的每个元素都是一行。所以没有一个可能包含两个LF!

如果你要逐行读取文件,你需要在一次通行中传递信息。

打开我的文件句柄,避免使用全局变量,并始终使用三参数open。
或者死掉:“ERROR: 无法打开日志文件`$file`:$!”;包括文件名

我的($found,$mAssize,$qualAssize);
while(< $INFILE >){ //不需要将整个文件加载到内存中。
   s / \ s + \ z / /; //删除行尾。处理`\ n`和`\ r\n`都可以

   如果$found &&length$ _)){
      打印匹配
      打印MaxAssemblySize $mAssize
      打印QualAssembly $qualAssize
   }

   $found =$mAssize$qualAssize= /([\d+\,]+)\s+([\d+\,]+\.\d+)\z /;
}
英文:

Each element of @data is a single line. So none could possibly contains two LF!

If you're going to read the file a line at a time, you will need to carry information from one pass to the next.

open( my $INFILE, &quot;&lt;&quot;, $file )           # Avoid globals, and always use 3-arg open
   or die( &quot;ERROR: Can&#39;t open log file `$file`: $!\n&quot; );  # Incl file name

my ( $found, $mAssize, $qualAssize );
while ( &lt;$INFILE&gt; ) {                    # No need to load entire file into mem.
   s/\s+\z//;  # Remove line endings. Handles both `\n` and `\r\n`.

   if ( $found &amp;&amp; !length( $_ ) ) {
      print &quot;Matched\n&quot;;
      print &quot;MaxAssemblySize $mAssize\n&quot;;
      print &quot;QualAssembly $qualAssize\n&quot;;
   }

   $found = ( $mAssize, $qualAssize ) = /([\d+\,]+)\s+([\d+\,]+\.\d+)\z/;
}

huangapple
  • 本文由 发表于 2023年7月17日 17:59:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76703344.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定