英文:
How to match two consecutive new lines using Perl
问题
这是您要翻译的内容:
I have a problem that I seem unable to solve, with a Perl program I coded to parse a particular results file.
Its aim is to capture two values from a table embedded in a `.txt` results file, along with many other lines of information.
Since this table doesn't have a fixed number of lines in every file, the only way I found to try to parse it, is detecting the two consecutive newlines after the table and then go "backward" to capture the values of interest placed in the last line.
Here is the regex I'm using unsuccessfully and an example of the results table.
$_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/)
Table (values of interest in bold+italic):
| begin | 1,699,932 | 10,136.45 |
|:---- |:------:| -----:|
| 1 | 1,712,388 | 12,455.32 |
| 2 | 1,712,605 | 12,484.85 |
| 3 | ***1,712,611*** | ***12,513.51*** |
I tried several regex tools online where my regex matches correctly, but once incorporated into my code, it just doesn't work...
Example of reproducible code:
#!usr/bin/perl -w
use strict;
use Getopt::Long;
my ($path);
GetOptions(
'path=s' => $path,
);
chdir $path or die "ERROR: Unable to enter $path: $!\n";
opendir (TEMP , ".");
my @files = readdir (TEMP);
closedir TEMP;
for my $file (@files) {
my $mAssize;
my $qualAssize;
if($file=~/(\w+)\_LRassembly.unicycler.log/){
my$sample=$1;
open(INFILE,"$file") or die ("ERROR: Unable to open Log to parse file $!\n");
chomp(my @data = <INFILE>);
for (@data) {
if($_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/){
print"Matched\n";
$mAssize=$1;
$qualAssize=$2;
print "MaxAssemblySize $mAssize\n";
print "QualAssembly $qualAssize\n";
}
}
print OUT ("$sample\t$mAssize\t$qualAssize\n") or die ("ERROR: Unable to write log parsing file $!\n");
}
close INFILE;
}
Find attached an example of a full results file containing the table:
[https://file.io/H8kCqE3gRov0][1]
[1]: https://file.io/H8kCqE3gRov0
Sample of the output file:
Polishing miniasm assembly with Racon (2023-07-08 00:32:20)
-----------------------------------------------------------
Unicycler now uses Racon to polish the miniasm assembly. It does multiple rounds of polishing to get the best consensus. Circular unitigs are rotated between rounds such that all parts (including the ends) are polished well.
Saving to /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/polishing_reads.fastq:
38,855 long reads
Polish Assembly Mapping
round size quality
begin 1,671,271 29,207.18
1 1,685,412 33,629.12
2 1,685,573 33,654.73
3 1,685,628 33,682.91
Best polish: /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/016_rotated.fasta
Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/13_racon_polished.gfa
Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/003_racon_polished.gfa
The expected output would be ti print the following line (taking the result sample as guide):
| sample | Assembly | Mapping |
|:---- |:------:| -----:|
| NKC1231 | 1,685,628 | 33,682.91|
英文:
I have a problem that I seem unable to solve, with a Perl program I coded to parse a particular results file.
Its aim is to capture two values from a table embedded in a .txt
results file, along with many other lines of information.
Since this table doesn't have a fixed number of lines in every file, the only way I found to try to parse it, is detecting the two consecutive newlines after the table and then go "backward" to capture the values of interest placed in the last line.
Here is the regex I'm using unsuccessfully and an example of the results table.
$_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/)
Table (values of interest in bold+italic):
begin | 1,699,932 | 10,136.45 |
---|---|---|
1 | 1,712,388 | 12,455.32 |
2 | 1,712,605 | 12,484.85 |
3 | 1,712,611 | 12,513.51 |
I tried several regex tools online where my regex matches correctly, but once incorporated into my code, it just doesn't work...
Example of reproducible code:
#!usr/bin/perl -w
use strict;
use Getopt::Long;
my ($path);
GetOptions(
'path=s' => $path,
);
chdir $path or die "ERROR: Unable to enter $path: $!\n";
opendir (TEMP , ".");
my @files = readdir (TEMP);
closedir TEMP;
for my $file (@files) {
my $mAssize;
my $qualAssize;
if($file=~/(\w+)\_LRassembly.unicycler.log/){
my$sample=$1;
open(INFILE,"$file") or die ("ERROR: Unable to open Log to parse file $!\n");
chomp(my @data = <INFILE>);
for (@data) {
if($_=~/([\d+\,]+)\s+([\d+\,]+\.\d+)\r\n\r\n/){
print"Matched\n";
$mAssize=$1;
$qualAssize=$2;
print "MaxAssemblySize $mAssize\n";
print "QualAssembly $qualAssize\n";
}
}
print OUT ("$sample\t$mAssize\t$qualAssize\n") or die ("ERROR: Unable to write log parsing file $!\n");
}
close INFILE;
}
Find attached an example of a full results file containing the table:
https://file.io/H8kCqE3gRov0
Sample of the output file:
Polishing miniasm assembly with Racon (2023-07-08 00:32:20)
-----------------------------------------------------------
Unicycler now uses Racon to polish the miniasm assembly. It does multiple rounds of polishing to get the best consensus. Circular unitigs are rotated between rounds such that all parts (including the ends) are polished well.
Saving to /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/polishing_reads.fastq:
38,855 long reads
Polish Assembly Mapping
round size quality
begin 1,671,271 29,207.18
1 1,685,412 33,629.12
2 1,685,573 33,654.73
3 1,685,628 33,682.91
Best polish: /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/racon_polish/016_rotated.fasta
Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/miniasm_assembly/13_racon_polished.gfa
Saving /storage/ONT/NETRAM_Campy/Filtered_reads/NKC1231_LRassembly/003_racon_polished.gfa
The expected output would be ti print the following line (taking the result sample as guide):
sample | Assembly | Mapping |
---|---|---|
NKC1231 | 1,685,628 | 33,682.91 |
答案1
得分: 2
@data`的每个元素都是一行。所以没有一个可能包含两个LF!
如果你要逐行读取文件,你需要在一次通行中传递信息。
打开我的文件句柄,避免使用全局变量,并始终使用三参数open。
或者死掉:“ERROR: 无法打开日志文件`$file`:$!”;包括文件名
我的($found,$mAssize,$qualAssize);
while(< $INFILE >){ //不需要将整个文件加载到内存中。
s / \ s + \ z / /; //删除行尾。处理`\ n`和`\ r\n`都可以。
如果($found &&!length($ _)){
打印“匹配”
打印“MaxAssemblySize $mAssize”
打印“QualAssembly $qualAssize”
}
$found =($mAssize,$qualAssize)= /([\d+\,]+)\s+([\d+\,]+\.\d+)\z /;
}
英文:
Each element of @data
is a single line. So none could possibly contains two LF!
If you're going to read the file a line at a time, you will need to carry information from one pass to the next.
open( my $INFILE, "<", $file ) # Avoid globals, and always use 3-arg open
or die( "ERROR: Can't open log file `$file`: $!\n" ); # Incl file name
my ( $found, $mAssize, $qualAssize );
while ( <$INFILE> ) { # No need to load entire file into mem.
s/\s+\z//; # Remove line endings. Handles both `\n` and `\r\n`.
if ( $found && !length( $_ ) ) {
print "Matched\n";
print "MaxAssemblySize $mAssize\n";
print "QualAssembly $qualAssize\n";
}
$found = ( $mAssize, $qualAssize ) = /([\d+\,]+)\s+([\d+\,]+\.\d+)\z/;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论