如何使用awk从具有多个条件的txt文件中提取相邻字符串?

huangapple go评论57阅读模式
英文:

How to extract adjacent strings from txt file with multiple conditions using awk?

问题

我有这个txt文件

[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 15:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjhab" id=432>
<type>New</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

我需要使用awk获取所有具有id=12321和类型"Old"的请求和响应。我以前从未使用过awk,而且我找不到一种方法来获取与带有id的字符串相邻的字符串。

我唯一成功获取多行的方法是使用grep,但仅适用于一个模式。

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

但使用grep我无法获取具有BOTH id=12321和类型"Old"的请求和响应。

也许我采取了错误的方法?非常感谢任何帮助。

英文:

I have this txt file

[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
[23/10/10 15:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjhab&quot; id=432&gt;
&lt;type&gt;New&lt;/type&gt;
&lt;/Request&gt;
[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

I need to use awk to get all requests and responses that have id=12321 AND type "Old".I've never used awk before and i can't find a way to get adjacent strings to the string with id.

The only way i managed to get multiple lines was with grep but only with one pattern.

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
--
[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

But with grep i can't get requests and responses that have BOTH id=12321 AND type "Old".

Maybe i'm taking wrong approach? Any help would be much appreciated.

答案1

得分: 3

使用 gnu-awk,您可以将 RS 变量设置为 &lt;/Request&gt;&lt;/Response&gt; 作为记录分隔符,然后在 $0 中检查两个搜索项:

awk -v RS='&lt;/Re(quest|sponse)&gt;' '/id=12321/ && /&lt;type&gt;Old/ {print $0 RT}' file

[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

英文:

Using gnu-awk you can set RS variable to &lt;/Request&gt; or &lt;/Response&gt; as record separator and then check for 2 search terms in $0:

awk -v RS=&#39;&lt;/Re(quest|sponse)&gt;&#39; &#39;/id=12321/ &amp;&amp; /&lt;type&gt;Old/ {print $0 RT}&#39; file

[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;

[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

答案2

得分: 2

使用适当的 xml 解析器,可以像这样进行操作:[tag:xidel]:

$ xidel -s --input-format=text file.txt -e '
    for $x in tokenize($raw,"\[.+\]  DEBUG")[.]
    return parse-xml($x)[./*[@id=12321 and type="Old"]]
' --output-node-format=xml --output-node-indent

感谢 Reino 提供的代码。

输出

<Request session="lkjh" id="12321">
  <type>Old</type>
</Request>
<Response session="lkjh" id="12321">
  <type>Old</type>
</Response>
英文:

Like this, with a proper xml parser: [tag:xidel]:

$ xidel -s --input-format=text file.txt -e &#39;
    for $x in tokenize($raw,&quot;\[.+\]  DEBUG&quot;)[.]
    return parse-xml($x)[./*[@id=12321 and type=&quot;Old&quot;]]
&#39; --output-node-format=xml --output-node-indent 

Credits to Reino

Output

&lt;Request session=&quot;lkjh&quot; id=&quot;12321&quot;&gt;
  &lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
&lt;Response session=&quot;lkjh&quot; id=&quot;12321&quot;&gt;
  &lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

答案3

得分: 1

常见的解决方案是将记录分隔符RS设置为某个唯一标识新记录的内容,以便在每次迭代中,当前记录由您想要检查的所有行组成(一个条目或相关序列;您的目标"thing")。您的测试数据中没有包含任何文字方括号,所以这是一个适用于您的示例数据的简单演示:

$ awk 'BEGIN { RS="[" } NR>1 && /id=12321/ && /<type>Old<\/type>/ { print "[" $0 }' <<:
> [23/10/10 14:37:44:527 EST]  DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjh" id=12321>
> <type>Old</type>
> </Request>
> [23/10/10 15:37:44:527 EST]  DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjhab" id=432>
> <type>New</type>
> </Request>
> [23/10/10 16:37:44:527 EST]  DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Response session="lkjh" id=12321>
> <type>Old</type>
> </Response>
> :
[23/10/10 14:37:44:527 EST]  DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST]  DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

如果您需要在数据中包含文字方括号,您可以牺牲分隔符行(具有方括号和DEBUG的行),并使用正则表达式,将整行用作分隔符;但这意味着该行的内容将被丢弃作为分隔符,不包含在输出中。 (您可以注意到我上面的代码添加回了被作为分隔符“吃掉”的[符号。)

英文:

A common solution is to set the record separator RS to something which uniquely identifies a new record, so that the current record in each iteration consists of all the lines you want to examine (one entry or related sequence; your target "thing"). Your test data didn't contain any literal square brackets so this is a simple demonstration which works for your sample data:

$ awk &#39;BEGIN { RS=&quot;[&quot; } NR&gt;1 &amp;&amp; /id=12321/ &amp;&amp; /&lt;type&gt;Old&lt;\/type&gt;/ { print &quot;[&quot; $0 }&#39; &lt;&lt;\:
&gt; [23/10/10 14:37:44:527 EST]  DEBUG
&gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&gt; &lt;Request session=&quot;lkjh&quot; id=12321&gt;
&gt; &lt;type&gt;Old&lt;/type&gt;
&gt; &lt;/Request&gt;
&gt; [23/10/10 15:37:44:527 EST]  DEBUG
&gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&gt; &lt;Request session=&quot;lkjhab&quot; id=432&gt;
&gt; &lt;type&gt;New&lt;/type&gt;
&gt; &lt;/Request&gt;
&gt; [23/10/10 16:37:44:527 EST]  DEBUG
&gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&gt; &lt;Response session=&quot;lkjh&quot; id=12321&gt;
&gt; &lt;type&gt;Old&lt;/type&gt;
&gt; &lt;/Response&gt;
&gt; :
[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;

[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

If you need to accommodate literal square brackets in the data as well, you might perhaps sacrifice the separator line (the one with the square brackets and the DEBUG) and use a regex which uses the entire line as the separator; but that then means that the contents of that line will be discarded as a separator, and not included in the output. (You'll notice that my code above adds back the [ which was "eaten" as a separator.)

答案4

得分: 1

使用您提供的任何版本的awk示例,请尝试以下代码。仅使用提供的示例编写和测试。

awk '
/^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
  if(flag2){
    print value
  }
  flag1=flag2=value=""
}
{
  value=(value?value ORS:"") $0
}
/ id=12321>/{
  flag1=1
  next
}
/<type>Old<\/type>/ && flag1{
  flag2=1
}
END{
  if(flag2){
    print value
  }
}
' Input_file
英文:

With your shown samples in any version of awk please try following code. Written and tested with shown samples Only.

awk &#39;
/^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
  if(flag2){
    print value
  }
  flag1=flag2=value=&quot;&quot;
}
{
  value=(value?value ORS:&quot;&quot;) $0
}
/ id=12321&gt;/{
  flag1=1
  next
}
/&lt;type&gt;Old&lt;\/type&gt;/ &amp;&amp; flag1{
  flag2=1
}
END{
  if(flag2){
    print value
  }
}
&#39;   Input_file

huangapple
  • 本文由 发表于 2023年3月3日 19:54:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626750.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定