如何在grep中使用正则表达式匹配多行并且只获取最后匹配的集合?

huangapple go评论59阅读模式
英文:

How do I use regex in grep to match multiple lines and only get the last matched set?

问题

我有一个包含一些统计信息的文件,如下所示

2023-01-01 01:00:00 总内存分配消耗:
2023-01-01 01:00:00 组件 | 使用率 (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 总计:52%(占已分配内存的百分比)
2023-01-01 01:00:00 总内存分配消耗:
2023-01-02 01:00:00 组件 | 使用率 (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 总计:78%(占已分配内存的百分比)
2023-01-01 01:00:00 总内存分配消耗:
2023-01-03 01:00:00 组件 | 使用率 (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 总计:60%(占已分配内存的百分比)
英文:

I have a file with some statistics like this

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

and I would like to cut out the last set of statistics (in the example above it would be the last 6 lines). As you can see, the amount of lines for each section can change, but the first and the last line stay constant. I was thinking about using:

  • "TOTAL" as an anchor point to grab the first and the last line of the wanted block of text
  • (?s) mode to match all lines in between those two

I ended up with this regex (?m)^.*?TOTAL(?s).*?(?m)TOTAL.*?$ and to use it in Linux, I used this command to get the wanted output using -P regex extension for grep (I haven't had much luck with -E regex extension)

tac con.log | grep -Po "(?m)^.*?TOTAL(?s).*?(?m)TOTAL.*?$" -m1 | tac

which resulted in this correct output

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

as expected, however this was in my testing environment which uses an old grep version 2.5.3 and when I tried it on my other machine running on Rocky Linux 9, which uses grep version 3.6 I am not getting any match. Considering this regex worked also when testing at regex101.com, I believe this might be a nuance of a newer grep. Is there anything special these newer versions of grep require for a regex like this to work or is there any other way how to get this result (ultimately, it will be used in a bash script)?

答案1

得分: 3

Here is the translated content:

使用Perl,一种方式

perl -0777 -wnE'$r = $1 while /(^[0-9\s:-]+TOTAL.+? TOTAL.+?$)/smxg; say $r' 文件

perl -0777 -wnE'say for /.*( ^[0-9\s:-]+ TOTAL.+? TOTAL.+?$ )/smxg' 文件

这会捕获并分配所有这样的记录,或者匹配整个文件,直到最后一个,但必须遍历整个文件;问题中的方法要对文件进行三次传递。如果性能是个问题,我们可以反向处理,就像这里一样。请参见这里的性能效果。

总之,我建议使用一个简短的脚本。

不确定为什么grep执行你展示的操作;我想上面的正则表达式应该可以工作,甚至可以稍微简化,使用grep的约定。


<sup>&dagger;</sup> 在最初由提问者发布的问题中有一个perl标签。

英文:

With Perl,<sup>&dagger;</sup> one way

perl -0777 -wnE&#39;$r = $1 while /(^[0-9\s:-]+TOTAL.+? TOTAL.+?$)/smxg; say $r&#39; file

or

perl -0777 -wnE&#39;say for /.*( ^[0-9\s:-]+ TOTAL.+? TOTAL.+?$ )/smxg&#39; file

This does capture and assign all such records, or matches the whole file, until it gets to the last one, but one has to go over the file; the approach from the question makes three passes over the file. We can process backwards if performance is an issue, like here for example. See the performance effect here.

Altogether I'd recommend a short script instead.

Not sure why grep does what you show; I'd imagine that the above regex should work, even slightly simplified using grep's conventions.


<sup>&dagger;</sup> In the question as originally posted by the OP there was a perl tag.

答案2

得分: 2

使用GNU grep:

grep -zPo '(?s).*\n\K.*TOTAL .*?TOTAL:.*?\n' con.log

这在3.7版本上可行。在2.20版本上似乎大部分能工作(会附加一个多余的换行符)。对于大型输入文件,这可能效率不高。

我怀疑你的正则表达式在regex101上有效,但在grep中使用时失败的原因是grep会逐行处理输入。因此,试图一次匹配多行的正则表达式总会失败。

使用tac和awk,以避免读取整个文件:

tac con.log | awk 's+=($3~/^TOTAL:?$/); s>1{exit}' | tac

s初始值为零/假。每次找到开始或结束行时,它会递增。当非零时,该行会被打印(默认操作)。当开始和结束行都匹配(s==2)时,我们中止。

假设日志中只有格式正确的记录。允许在统计记录之间插入无关数据。

如果文件可能以不完整的记录结束(应该被忽略),则有:

tac con.log | awk '
    !s && $3=="TOTAL:" { s=1 }
    s;
    s && $3=="TOTAL" { exit }
' | tac

如果日志文件不包含无关数据(只是完整的统计记录列表),则只需要测试终止条件:

tac con.log | awk '1; $3=="TOTAL"{exit}' | tac

假设输出行数永远不会超过某个阈值(这里是1000行),还有一个直接的tac和(GNU)grep解决方案,无论是否存在不完整的最后一条记录:

tac con.log |
grep -A1000 -m1 'TOTAL:' |
grep -B1000 -m1 'TOTAL ' |
tac
英文:

With GNU grep:

grep -zPo &#39;(?s).*\n\K.*TOTAL .*?TOTAL:.*?\n&#39; con.log

This works with 3.7. Seems to mostly work with version 2.20 (appends an extraneous newline). It is likely to be inefficient with huge input files.

I suspect the reason your regex that works at regex101 is failing when used with grep is that grep applies the regex to each line of input in turn. So a regex that tries to match multiple lines at once is always going to fail.


With tac and awk, to avoid reading the entire file:

tac con.log | awk &#39;s+=($3~/^TOTAL:?$/); s&gt;1{exit}&#39; | tac

s starts as zero/false. Each time a start or finish line is found, it is incremented. When non-zero, the line is printed (default action). When both start and finish lines have matched (s==2), we abort.

Assumes only well-formed records in the log. Allows for unrelated data interspersed between statistics records.

If the file could end with a partial record (that should be ignored), there is:

tac con.log | awk &#39;
    !s &amp;&amp; $3==&quot;TOTAL:&quot; { s=1 }
    s;
    s &amp;&amp; $3==&quot;TOTAL&quot; { exit }
&#39; | tac

If the log file contains no unrelated data (just a list of complete statistics records), then only the termination condition needs to be tested:

tac con.log | awk &#39;1; $3==&quot;TOTAL&quot;{exit}&#39; | tac

Assuming that number of lines of output will never exceed some threshold (here 1000), there is also a straightforward tac and (GNU) grep solution that works whether or not there is a partial final record:

tac con.log |
grep -A1000 -m1 &#39;TOTAL:&#39; |
grep -B1000 -m1 &#39;TOTAL &#39; |
tac

答案3

得分: 2

以下是已翻译的部分:

或者以极度懒惰的方式执行:

    echo '
    2023-01-01 01:00:00 总内存分配消耗:
    2023-01-01 01:00:00 组件 | 使用率 (%)
    2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
    2023-01-01 01:00:00 class.fff.aaa.ggg | 20
    2023-01-01 01:00:00 总计:52% 的内存分配已消耗,总共分配了 100%
    2023-01-01 01:00:00 总内存分配消耗:
    2023-01-02 01:00:00 组件 | 使用率 (%)
    2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
    2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
    2023-01-02 01:00:00 class.zzz.xxx | 21
    2023-01-02 01:00:00 class.xxx.sss.ggg | 5
    2023-01-02 01:00:00 总计:78% 的内存分配已消耗,总共分配了 100%
    2023-01-01 01:00:00 总内存分配消耗:
    2023-01-03 01:00:00 组件 | 使用率 (%)
    2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
    2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
    2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
    2023-01-03 01:00:00 总计:60% 的内存分配已消耗,总共分配了 100%' | 
---
    mawk 'BEGIN { RS = ORS = " consumed\n" } END { print }'   
                                                          — 或者 -
    gawk 'BEGIN { RS=(ORS=FS=" consumed\n")"$" } $0=$NF'
---

2023-01-01 01:00:00 总内存分配消耗:
2023-01-03 01:00:00 组件 | 使用率 (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 总计:60% 的内存分配已消耗,总共分配了 100%
英文:

or just do it the ultra lazy way :

echo &#39;
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed&#39; | 

mawk &#39;BEGIN { RS = ORS = &quot; consumed\n&quot; } END { print }&#39;   
                                                      — or even -
gawk &#39;BEGIN { RS=(ORS=FS=&quot; consumed\n&quot;)&quot;$&quot; } $0=$NF&#39; 

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

答案4

得分: 1

以下是您代码中需要翻译的部分:

"The key observation with your data is you want the data from the last occurrence of TOTAL MEMORY ALLOCATION CONSUMPTION in your input dataset. You can use greedy matching to achieve that."

"running that gives"

"That can all be condensed into a one-liner"

"cat your data | perl -e 's/.+(?=^.+?TOTAL\sMEMORY ALLOCATION CONSUMPTION)//sm'"

英文:

The key observation with your data is you want the data from the last occurrence of TOTAL MEMORY ALLOCATION CONSUMPTION in your input dataset. You can use greedy matching to achieve that

use strict;
use warnings;


my $data = &lt;&lt;EOM;
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed
EOM

$data =~ s/.+           # Do a greedy match
           (?=          # non-capturing group lookahead
              ^         #     Start of a line
              .+?       #     non-greedy match
              TOTAL\sMEMORY\sALLOCATION\sCONSUMPTION # literal string
            )           # end of lookahead
            //smx; # allow . to match newline &amp; ^ to match start of line

print $data;

running that gives

$ perl try.pl 
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

That can all be condensed into a one-liner

cat your data | perl -e &#39;s/.+(?=^.+?TOTAL\sMEMORY ALLOCATION CONSUMPTION)//sm&#39;

答案5

得分: 1

以下是已翻译的代码部分:

awk '/TOTAL MEMORY/ { p=$0; next }
  p { p = p ORS $0 }
  /TOTAL:/ { result=p; p="" }
  END { print result }' file

这个代码实现了一个简单的状态机,它会将当前条目中的所有行收集到一个字符串中,然后在最后打印出(最后)收集到的字符串。

更详细地说,回想一下,Awk会逐行(或更广泛地说,逐输入记录)运行脚本。当我们看到第一个正则表达式时,我们开始收集项目到 p 中,并跳过此行的其余部分。在后续行中,只要 p 不为空,我们就会将行添加到其中,用 ORS 分隔(默认为换行符),然后当我们遇到一个匹配 TOTAL: 的输入行时,我们停止收集,并将当前收集到的 p 复制到 result 中。最后,在达到输入流的末尾后,END 块运行,我们打印出我们最后收集到的字符串。

英文:

For completeness, a simple Awk script.

awk &#39;/TOTAL MEMORY/ { p=$0; next }
p { p = p ORS $0 }
/TOTAL:/ { result=p; p=&quot;&quot; }
END { print result }&#39; file

This implements a simple state machine where we collect all the lines in the current entry into a string and then at the end print out the (last) collected string.

In some more detail, recall that Awk runs the script on each incoming line (or, more broadly, input record) at a time. When we see the first regex, we start collecting items into p, and skip the rest of the script for this line. On subsequent lines, as long as p is nonempty, we add lines to it, separated by ORS, the output record separator (defaults to newline) and then when we reach an input line which matches TOTAL: we stop collecting, and copy the currently collected p into result. Finally, the END block runs after we reach the end of the input stream, and we print whatever string we last collected into result.

In addition to being portable way back to the original AT&T Unix, this is also easy to understand and modify; the regular expressions are trivial, and the overall logic is reasonably simple and obvious.

答案6

得分: 0

Using any awk in any shell on every Unix box:

$ awk '/TOTAL /{rec=$0; next} {rec=rec ORS $0} END{print rec}' file
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed
英文:

Using any awk in any shell on every Unix box:

$ awk &#39;/TOTAL /{rec=$0; next} {rec=rec ORS $0} END{print rec}&#39; file
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

答案7

得分: 0

Grep搭配tail完成任务:

$ grep TOTAL: file -B6 | tail -n6
2023年01月01日01:00:00 总内存分配消耗:
2023年01月03日01:00:00 组件 | 使用率 (%)
2023年01月03日01:00:00 class.xxx.yyy.zzz | 10
2023年01月03日01:00:00 class.xxx.zzz.aaa | 20
2023年01月03日01:00:00 class.zzz.aaa.bbb | 30
2023年01月03日01:00:00 总计:已使用分配内存的60%,占100%。
英文:

Grep with tail will do the job:

$ grep TOTAL: file -B6 | tail -n6
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

huangapple
  • 本文由 发表于 2023年5月25日 23:22:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333924.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定