用Perl读取并筛选来自文件的输入

huangapple go评论59阅读模式
英文:

perl read and filter input from file

问题

以下是您要翻译的内容:

我有一个数据输入文件格式如下所示

    <name> <attr1> <attr2> <attr3> <working_area> <date>
    alan x x x /path/to/alan_work/a Wed_May_17_04:17:40_2023
    alan x x x /path/to/alan_work/b Sun_May_28_21:22:52_2023
    alan x a x /path/to/alan_work/c Sun_May_28_22:25:47_2023
    ben x x x /path/to/ben_work/a Wed_May_17_04:18:44_2023
    ben a b x /path/to/ben_work/b Wed_May_17_08:19:47_2023
    charles a a a /path/to/charles_work/a Wed_May_17_04:17:40_2023
    charles a a a /path/to/charles_work/b Thurs_May_18_04:17:40_2023
    ben x x x /path/to/ben_work/c Fri_May_19_04:18:44_2023

我正在编写Perl脚本希望实现以下标准

1. 对于相同的用户如果2个或更多不同工作区中的所有属性12和3都相同则获取具有最新日期属性的工作区路径


预期输出

    /path/to/alan_work/b
    /path/to/alan_work/c
    /path/to/ben_work/c
    /path/to/ben_work/b
    /path/to/charles_work/b

短小代码片段我不知道如何继续

    open(FF, '<', $temp_file) or die "cannot open $temp_file";
        while (my $line = <FF>) {
          chomp $line;
          my @split_type = split(' ', $line);
    	#在这里不知道该怎么做
        }
英文:

I have data input file having format as below example,

<name> <attr1> <attr2> <attr3> <working_area> <date>
alan x x x /path/to/alan_work/a Wed_May_17_04:17:40_2023
alan x x x /path/to/alan_work/b Sun_May_28_21:22:52_2023
alan x a x /path/to/alan_work/c Sun_May_28_22:25:47_2023
ben x x x /path/to/ben_work/a Wed_May_17_04:18:44_2023
ben a b x /path/to/ben_work/b Wed_May_17_08:19:47_2023
charles a a a /path/to/charles_work/a Wed_May_17_04:17:40_2023
charles a a a /path/to/charles_work/b Thurs_May_18_04:17:40_2023
ben x x x /path/to/ben_work/c Fri_May_19_04:18:44_2023

I am writing perl script and want to achieve below criteria:

  1. For same user, if all attributes 1, 2 and 3 are the same among 2 or more different working area, get the working area path that with latest date attribute

Expected Output:

/path/to/alan_work/b
/path/to/alan_work/c
/path/to/ben_work/c
/path/to/ben_work/b
/path/to/charles_work/b

Short snippet (I have no idea how to proceed)

open(FF, '<', $temp_file) or die "cannot open $temp_file";
    while (my $line = <FF>) {
      chomp $line;
      my @split_type = split(' ', $line);
	#no idea here
    } 

答案1

得分: 1

由于值之间用空格分隔,日期组件之间用下划线分隔,因此处理这个问题相当直接。

我们将使用用户名和属性作为哈希的键,并将哈希的值替换为具有最高日期值的工作路径。

为了使这个工作,我们必须将日期转换为可以进行比较的标准形式:

use strict;
use warnings;
use v5.10;

my $file = 'input.txt';
open my $fh, '<', $file or die "Could not open $file: $!\n";

my %paths;
while(<$fh>){
    /^</ and next;     # 跳过标题
    my ($name, $attr1, $attr2, $attr3, $workpath, $date) = split;
    my $key = "$name|$attr1$attr2$attr3";
    $date = transformDate($date);

    $paths{$key} = [$date, $workpath]
        if !defined $paths{$key} || $date gt $paths{$key}[0];
}

say $paths{$_}[1] for sort keys %paths;

# 将日期从:Wed_May_17_04:17:40_2023
# 转换为:2023051704:17:40
sub transformDate {
    my $date = shift;
    state $monthindex = {
        Jan => 1,  Feb => 2,  Mar => 3,
        Apr => 4,  May => 5,  Jun => 6,
        Jul => 7,  Aug => 8,  Sep => 9,
        Oct => 10, Nov => 11, Dec => 12,
    };
    my (undef, $month, $day, $time, $year) = split/_/, $date;
    sprintf('%d%02d%02d%s', $year, $monthindex->{$month}, $day, $time);
}

编辑: 删除了备用日期解析,因为在澄清日期格式后不再需要。

英文:

Since the values are separated with spaces, and the date components are separated with underscores, processing this is fairly straight forward.

We'll use the username and attributes as key to a hash, and replace the value of the hash with the workpath for the highest date value.

To make this work, we have to transform the date to a standard form that can be compared:

use strict;
use warnings;
use v5.10;

my $file = &#39;input.txt&#39;;
open my $fh, &#39;&lt;&#39;, $file or die &quot;Could not open $file: $!\n&quot;;

my %paths;
while(&lt;$fh&gt;){
    /^&lt;/ and next;     # skip the header
    my ($name, $attr1, $attr2, $attr3, $workpath, $date) = split;
    my $key = &quot;$name|$attr1$attr2$attr3&quot;;
    $date = transformDate($date);

    $paths{$key} = [$date, $workpath]
        if !defined $paths{$key} || $date gt $paths{$key}[0];
}

say $paths{$_}[1] for sort keys %paths;

# change date from: Wed_May_17_04:17:40_2023
#          to this: 2023051704:17:40
sub transformDate {
    my $date = shift;
    state $monthindex = {
        Jan =&gt; 1,  Feb =&gt; 2,  Mar =&gt; 3,
        Apr =&gt; 4,  May =&gt; 5,  Jun =&gt; 6,
        Jul =&gt; 7,  Aug =&gt; 8,  Sep =&gt; 9,
        Oct =&gt; 10, Nov =&gt; 11, Dec =&gt; 12,
    };
    my (undef, $month, $day, $time, $year) = split/_/, $date;
    sprintf(&#39;%d%02d%02d%s&#39;, $year, $monthindex-&gt;{$month}, $day, $time);
}

Edit: removed the alternative date parsing, since it was not needed after clarifying the date format.

答案2

得分: 0

将数据存储在一个由名称、属性和区域作为键,使用日期作为值的哈希表中。按值对区域进行排序(您需要为您的日期格式实现日期比较,或在填充哈希表时解析它,并使用可比较的值填充哈希表),然后返回最后一个。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

# 这需要正确解析日期,但对于示例而言,它有效,因为要比较的日期始终在同一个月且不会在同一天。
sub by_date {
    my ($dates_by_area, $A, $B) = @_;
    $dates_by_area->{$A} =~ /May_?([0-9]+)/;
    my $day_a = $1;
    $dates_by_area->{$B} =~ /May_?([0-9]+)/;
    my $day_b = $1;
    $day_a <=> $day_b
}

my $temp_file = shift;

open my $in, '<', $temp_file or die "cannot open $temp_file";
my %dates;
while (my $line = <$in>) {
    next if $line =~ /^</;

    my ($name, $attr1, $attr2, $attr3, $area, $date) = split ' ', $line;
    $dates{$name}{$attr1}{$attr2}{$attr3}{$area} = $date;
}

for my $name (keys %dates) {
    for my $attr1 (keys %{ $dates{$name} }) {
        for my $attr2 (keys %{ $dates{$name}{$attr1} }) {
            for my $attr3 (keys %{ $dates{$name}{$attr1}{$attr2} }) {
                my %dates_by_area = %{ $dates{$name}{$attr1}{$attr2}{$attr3} };
                my @sorted = sort { by_date(\%dates_by_area, $a, $b) }
                             keys %dates_by_area;
                say $sorted[-1];
            }
        }
    }
}

%data 中收集的结构可以使用以下代码进行检查:

use Data::Dumper;
warn Dumper \%data;

对于示例,它会产生以下输出:

$VAR1 = {
          'alan' => {
                      'x' => {
                               'x' => {
                                        'x' => {
                                                 '/path/to/alan_work/a' => 'Wed_May17_04:17:40_2023',
                                                 '/path/to/alan_work/b' => 'Sun_May_28_21:22:52_2023'
                                               }
                                      },
                               'a' => {
                                        'x' => {
                                                 '/path/to/alan_work/c' => 'Sun_May_28_22:25:47_2023'
                                               }
                                      }
                             }
                    },
          'ben' => {
                     'x' => {
                              'x' => {
                                       'x' => {
                                                '/path/to/ben_work/a' => 'Wed_May17_04:18:44_2023',
                                                '/path/to/ben_work/c' => 'Fri_May19_04:18:44_2023'
                                              }
                                     }
                            },
                     'a' => {
                              'b' => {
                                       'x' => {
                                                '/path/to/ben_work/b' => 'Wed_May17_08:19:47_2023'
                                              }
                                     }
                            }
                   },
          'charles' => {
                         'a' => {
                                  'a' => {
                                           'a' => {
                                                    '/path/to/charles_work/a' => 'Wed_May17_04:17:40_2023',
                                                    '/path/to/charles_work/b' => 'Thurs_May18_04:17:40_2023'
                                                  }
                                         }
                                }
                       }
        };

如果相同名称、属性和区域有两个不同的日期,您没有提供任何指示应该发生什么。当前的实现只使用输入中相应的最后一行。

此外,您可以注意我切换到词法文件句柄以避免裸字文件句柄带来的问题。使用 split ' ' 时,不需要 chomp,因为这种特殊形式的 split 会删除包括换行符在内的尾随空白。

英文:

Store the data in a hash keyed by the name, attributes, and the area, use dates as the values. Sort the areas by the values (you need to implement date comparison for your format, or parse it when populating the hash and populate the hash with comparable values) and return the last one.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

# This needs to properly parse the dates, but for the example it
# works, as the dates to compare are always in the same month and never
# on the same day.
sub by_date {
    my ($dates_by_area, $A, $B) = @_;
    $dates_by_area-&gt;{$A} =~ /May_?([0-9]+)/;
    my $day_a = $1;
    $dates_by_area-&gt;{$B} =~ /May_?([0-9]+)/;
    my $day_b = $1;
    $day_a &lt;=&gt; $day_b
}

my $temp_file = shift;

open my $in, &#39;&lt;&#39;, $temp_file or die &quot;cannot open $temp_file&quot;;
my %dates;
while (my $line = &lt;$in&gt;) {
    next if $line =~ /^&lt;/;

    my ($name, $attr1, $attr2, $attr3, $area, $date) = split &#39; &#39;, $line;
    $dates{$name}{$attr1}{$attr2}{$attr3}{$area} = $date;
}

for my $name (keys %dates) {
    for my $attr1 (keys %{ $dates{$name} }) {
        for my $attr2 (keys %{ $dates{$name}{$attr1} }) {
            for my $attr3 (keys %{ $dates{$name}{$attr1}{$attr2} }) {
                my %dates_by_area = %{ $dates{$name}{$attr1}{$attr2}{$attr3} };
                my @sorted = sort { by_date(\%dates_by_area, $a, $b) }
                             keys %dates_by_area;
                say $sorted[-1];
            }
        }
    }
}

The structure collected in %data can be inspected using

use Data::Dumper;
warn Dumper \%data;

witch gives the following output for the sample:

$VAR1 = {
          &#39;alan&#39; =&gt; {
                      &#39;x&#39; =&gt; {
                               &#39;x&#39; =&gt; {
                                        &#39;x&#39; =&gt; {
                                                 &#39;/path/to/alan_work/a&#39; =&gt; &#39;Wed_May17_04:17:40_2023&#39;,
                                                 &#39;/path/to/alan_work/b&#39; =&gt; &#39;Sun_May_28_21:22:52_2023&#39;
                                               }
                                      },
                               &#39;a&#39; =&gt; {
                                        &#39;x&#39; =&gt; {
                                                 &#39;/path/to/alan_work/c&#39; =&gt; &#39;Sun_May_28_22:25:47_2023&#39;
                                               }
                                      }
                             }
                    },
          &#39;ben&#39; =&gt; {
                     &#39;x&#39; =&gt; {
                              &#39;x&#39; =&gt; {
                                       &#39;x&#39; =&gt; {
                                                &#39;/path/to/ben_work/a&#39; =&gt; &#39;Wed_May17_04:18:44_2023&#39;,
                                                &#39;/path/to/ben_work/c&#39; =&gt; &#39;Fri_May19_04:18:44_2023&#39;
                                              }
                                     }
                            },
                     &#39;a&#39; =&gt; {
                              &#39;b&#39; =&gt; {
                                       &#39;x&#39; =&gt; {
                                                &#39;/path/to/ben_work/b&#39; =&gt; &#39;Wed_May17_08:19:47_2023&#39;
                                              }
                                     }
                            }
                   },
          &#39;charles&#39; =&gt; {
                         &#39;a&#39; =&gt; {
                                  &#39;a&#39; =&gt; {
                                           &#39;a&#39; =&gt; {
                                                    &#39;/path/to/charles_work/a&#39; =&gt; &#39;Wed_May17_04:17:40_2023&#39;,
                                                    &#39;/path/to/charles_work/b&#39; =&gt; &#39;Thurs_May18_04:17:40_2023&#39;
                                                  }
                                         }
                                }
                       }
        };

You gave no instructions what should happen if there are two different days for the same name, attributes, and area. The current implementation just uses the last corresponding line from the input.

Also, you can notice I switched to lexical filehandles to avoid problems bareword filehandles bring.
When using split &#39; &#39;, you don't need to chomp, as this special form of split removes the trailing whitespace including a newline.

答案3

得分: 0

    # 提取字段
    ($u, $a1, $a2, $a3, $p, $_) = split;
    $id = "$u $a1 $a2 $a3";

    # 调整日期格式为标准形式
    y/[A-Za-z0-9]//cd;
    s/.*([A-Z][a-z]{2})[^\d]*/$1/;
    eval {
        $t = Time::Piece->strptime($_, "%b%d%H%M%S%Y")->datetime;
    } or do {
        # 添加错误处理
        # (这也会捕获任何标题行)
        next;
    };

    # 如果日期“更好”则保存路径
    if ($t ge $ts{$id}) {
        $ts{$id} = $t;
        $ps{$id} = $p;
    }

    # 打印结果
    END { say for sort values %ps }
' datafile
英文:
perl -MTime::Piece -nE &#39;
    # extract fields
    ($u,$a1,$a2,$a3,$p,$_) = split;
    $id = &quot;$u $a1 $a2 $a3&quot;;

    # massage date format into standard form
    y/[A-Za-z0-9]//cd;
    s/.*([A-Z][a-z]{2})[^\d]*/$1/;
    eval {
        $t = Time::Piece-&gt;strptime($_,&quot;%b%d%H%M%S%Y&quot;)-&gt;datetime;
    } or do {
        # add error handling
        # (this also catches any header)
        next;
    };

    # save path if &quot;better&quot;
    if ($t ge $ts{$id}) {
        $ts{$id} = $t;
        $ps{$id} = $p;
    }

    # print results
    END { say for sort values %ps }
&#39; datafile

huangapple
  • 本文由 发表于 2023年5月29日 21:44:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76357903.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定