如何使用awk从具有多个条件的txt文件中提取相邻字符串?

huangapple go评论82阅读模式
英文:

How to extract adjacent strings from txt file with multiple conditions using awk?

问题

我有这个txt文件

[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 15:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjhab" id=432>
<type>New</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

我需要使用awk获取所有具有id=12321和类型"Old"的请求和响应。我以前从未使用过awk,而且我找不到一种方法来获取与带有id的字符串相邻的字符串。

我唯一成功获取多行的方法是使用grep,但仅适用于一个模式。

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

但使用grep我无法获取具有BOTH id=12321和类型"Old"的请求和响应。

也许我采取了错误的方法?非常感谢任何帮助。

英文:

I have this txt file

  1. [23/10/10 14:37:44:527 EST] DEBUG
  2. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  3. &lt;Request session=&quot;lkjh&quot; id=12321&gt;
  4. &lt;type&gt;Old&lt;/type&gt;
  5. &lt;/Request&gt;
  6. [23/10/10 15:37:44:527 EST] DEBUG
  7. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  8. &lt;Request session=&quot;lkjhab&quot; id=432&gt;
  9. &lt;type&gt;New&lt;/type&gt;
  10. &lt;/Request&gt;
  11. [23/10/10 16:37:44:527 EST] DEBUG
  12. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  13. &lt;Response session=&quot;lkjh&quot; id=12321&gt;
  14. &lt;type&gt;Old&lt;/type&gt;
  15. &lt;/Response&gt;

I need to use awk to get all requests and responses that have id=12321 AND type "Old".I've never used awk before and i can't find a way to get adjacent strings to the string with id.

The only way i managed to get multiple lines was with grep but only with one pattern.

  1. $ grep id=12321 file.txt -B2 -A2
  2. [23/10/10 14:37:44:527 EST] DEBUG
  3. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  4. &lt;Request session=&quot;lkjh&quot; id=12321&gt;
  5. &lt;type&gt;Old&lt;/type&gt;
  6. &lt;/Request&gt;
  7. --
  8. [23/10/10 16:37:44:527 EST] DEBUG
  9. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  10. &lt;Response session=&quot;lkjh&quot; id=12321&gt;
  11. &lt;type&gt;Old&lt;/type&gt;
  12. &lt;/Response&gt;

But with grep i can't get requests and responses that have BOTH id=12321 AND type "Old".

Maybe i'm taking wrong approach? Any help would be much appreciated.

答案1

得分: 3

使用 gnu-awk,您可以将 RS 变量设置为 &lt;/Request&gt;&lt;/Response&gt; 作为记录分隔符,然后在 $0 中检查两个搜索项:

  1. awk -v RS='&lt;/Re(quest|sponse)&gt;' '/id=12321/ && /&lt;type&gt;Old/ {print $0 RT}' file

[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

英文:

Using gnu-awk you can set RS variable to &lt;/Request&gt; or &lt;/Response&gt; as record separator and then check for 2 search terms in $0:

  1. awk -v RS=&#39;&lt;/Re(quest|sponse)&gt;&#39; &#39;/id=12321/ &amp;&amp; /&lt;type&gt;Old/ {print $0 RT}&#39; file
  2. [23/10/10 14:37:44:527 EST] DEBUG
  3. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  4. &lt;Request session=&quot;lkjh&quot; id=12321&gt;
  5. &lt;type&gt;Old&lt;/type&gt;
  6. &lt;/Request&gt;
  7. [23/10/10 16:37:44:527 EST] DEBUG
  8. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  9. &lt;Response session=&quot;lkjh&quot; id=12321&gt;
  10. &lt;type&gt;Old&lt;/type&gt;
  11. &lt;/Response&gt;

答案2

得分: 2

使用适当的 xml 解析器,可以像这样进行操作:[tag:xidel]:

  1. $ xidel -s --input-format=text file.txt -e '
  2. for $x in tokenize($raw,"\[.+\] DEBUG")[.]
  3. return parse-xml($x)[./*[@id=12321 and type="Old"]]
  4. ' --output-node-format=xml --output-node-indent

感谢 Reino 提供的代码。

输出

  1. <Request session="lkjh" id="12321">
  2. <type>Old</type>
  3. </Request>
  4. <Response session="lkjh" id="12321">
  5. <type>Old</type>
  6. </Response>
英文:

Like this, with a proper xml parser: [tag:xidel]:

  1. $ xidel -s --input-format=text file.txt -e &#39;
  2. for $x in tokenize($raw,&quot;\[.+\] DEBUG&quot;)[.]
  3. return parse-xml($x)[./*[@id=12321 and type=&quot;Old&quot;]]
  4. &#39; --output-node-format=xml --output-node-indent

Credits to Reino

Output

  1. &lt;Request session=&quot;lkjh&quot; id=&quot;12321&quot;&gt;
  2. &lt;type&gt;Old&lt;/type&gt;
  3. &lt;/Request&gt;
  4. &lt;Response session=&quot;lkjh&quot; id=&quot;12321&quot;&gt;
  5. &lt;type&gt;Old&lt;/type&gt;
  6. &lt;/Response&gt;

答案3

得分: 1

常见的解决方案是将记录分隔符RS设置为某个唯一标识新记录的内容,以便在每次迭代中,当前记录由您想要检查的所有行组成(一个条目或相关序列;您的目标"thing")。您的测试数据中没有包含任何文字方括号,所以这是一个适用于您的示例数据的简单演示:

  1. $ awk 'BEGIN { RS="[" } NR>1 && /id=12321/ && /<type>Old<\/type>/ { print "[" $0 }' <<:
  2. > [23/10/10 14:37:44:527 EST] DEBUG
  3. > <?xml version="1.1" encoding="UTF-8" ?>
  4. > <Request session="lkjh" id=12321>
  5. > <type>Old</type>
  6. > </Request>
  7. > [23/10/10 15:37:44:527 EST] DEBUG
  8. > <?xml version="1.1" encoding="UTF-8" ?>
  9. > <Request session="lkjhab" id=432>
  10. > <type>New</type>
  11. > </Request>
  12. > [23/10/10 16:37:44:527 EST] DEBUG
  13. > <?xml version="1.1" encoding="UTF-8" ?>
  14. > <Response session="lkjh" id=12321>
  15. > <type>Old</type>
  16. > </Response>
  17. > :
  18. [23/10/10 14:37:44:527 EST] DEBUG
  19. <?xml version="1.1" encoding="UTF-8" ?>
  20. <Request session="lkjh" id=12321>
  21. <type>Old</type>
  22. </Request>
  23. [23/10/10 16:37:44:527 EST] DEBUG
  24. <?xml version="1.1" encoding="UTF-8" ?>
  25. <Response session="lkjh" id=12321>
  26. <type>Old</type>
  27. </Response>

如果您需要在数据中包含文字方括号,您可以牺牲分隔符行(具有方括号和DEBUG的行),并使用正则表达式,将整行用作分隔符;但这意味着该行的内容将被丢弃作为分隔符,不包含在输出中。 (您可以注意到我上面的代码添加回了被作为分隔符“吃掉”的[符号。)

英文:

A common solution is to set the record separator RS to something which uniquely identifies a new record, so that the current record in each iteration consists of all the lines you want to examine (one entry or related sequence; your target "thing"). Your test data didn't contain any literal square brackets so this is a simple demonstration which works for your sample data:

  1. $ awk &#39;BEGIN { RS=&quot;[&quot; } NR&gt;1 &amp;&amp; /id=12321/ &amp;&amp; /&lt;type&gt;Old&lt;\/type&gt;/ { print &quot;[&quot; $0 }&#39; &lt;&lt;\:
  2. &gt; [23/10/10 14:37:44:527 EST] DEBUG
  3. &gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  4. &gt; &lt;Request session=&quot;lkjh&quot; id=12321&gt;
  5. &gt; &lt;type&gt;Old&lt;/type&gt;
  6. &gt; &lt;/Request&gt;
  7. &gt; [23/10/10 15:37:44:527 EST] DEBUG
  8. &gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  9. &gt; &lt;Request session=&quot;lkjhab&quot; id=432&gt;
  10. &gt; &lt;type&gt;New&lt;/type&gt;
  11. &gt; &lt;/Request&gt;
  12. &gt; [23/10/10 16:37:44:527 EST] DEBUG
  13. &gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  14. &gt; &lt;Response session=&quot;lkjh&quot; id=12321&gt;
  15. &gt; &lt;type&gt;Old&lt;/type&gt;
  16. &gt; &lt;/Response&gt;
  17. &gt; :
  18. [23/10/10 14:37:44:527 EST] DEBUG
  19. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  20. &lt;Request session=&quot;lkjh&quot; id=12321&gt;
  21. &lt;type&gt;Old&lt;/type&gt;
  22. &lt;/Request&gt;
  23. [23/10/10 16:37:44:527 EST] DEBUG
  24. &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
  25. &lt;Response session=&quot;lkjh&quot; id=12321&gt;
  26. &lt;type&gt;Old&lt;/type&gt;
  27. &lt;/Response&gt;

If you need to accommodate literal square brackets in the data as well, you might perhaps sacrifice the separator line (the one with the square brackets and the DEBUG) and use a regex which uses the entire line as the separator; but that then means that the contents of that line will be discarded as a separator, and not included in the output. (You'll notice that my code above adds back the [ which was "eaten" as a separator.)

答案4

得分: 1

使用您提供的任何版本的awk示例,请尝试以下代码。仅使用提供的示例编写和测试。

  1. awk '
  2. /^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
  3. if(flag2){
  4. print value
  5. }
  6. flag1=flag2=value=""
  7. }
  8. {
  9. value=(value?value ORS:"") $0
  10. }
  11. / id=12321>/{
  12. flag1=1
  13. next
  14. }
  15. /<type>Old<\/type>/ && flag1{
  16. flag2=1
  17. }
  18. END{
  19. if(flag2){
  20. print value
  21. }
  22. }
  23. ' Input_file
英文:

With your shown samples in any version of awk please try following code. Written and tested with shown samples Only.

  1. awk &#39;
  2. /^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
  3. if(flag2){
  4. print value
  5. }
  6. flag1=flag2=value=&quot;&quot;
  7. }
  8. {
  9. value=(value?value ORS:&quot;&quot;) $0
  10. }
  11. / id=12321&gt;/{
  12. flag1=1
  13. next
  14. }
  15. /&lt;type&gt;Old&lt;\/type&gt;/ &amp;&amp; flag1{
  16. flag2=1
  17. }
  18. END{
  19. if(flag2){
  20. print value
  21. }
  22. }
  23. &#39; Input_file

huangapple
  • 本文由 发表于 2023年3月3日 19:54:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626750.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定