2023年3月3日 19:54:06go评论82阅读模式

英文:

How to extract adjacent strings from txt file with multiple conditions using awk?

问题

我有这个txt文件

[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 15:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjhab" id=432>
<type>New</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

我需要使用awk获取所有具有id=12321和类型"Old"的请求和响应。我以前从未使用过awk，而且我找不到一种方法来获取与带有id的字符串相邻的字符串。

我唯一成功获取多行的方法是使用grep，但仅适用于一个模式。

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

但使用grep我无法获取具有BOTH id=12321和类型"Old"的请求和响应。

也许我采取了错误的方法？非常感谢任何帮助。

英文:

I have this txt file

[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
[23/10/10 15:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjhab&quot; id=432&gt;
&lt;type&gt;New&lt;/type&gt;
&lt;/Request&gt;
[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

I need to use awk to get all requests and responses that have id=12321 AND type "Old".I've never used awk before and i can't find a way to get adjacent strings to the string with id.

The only way i managed to get multiple lines was with grep but only with one pattern.

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
--
[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

But with grep i can't get requests and responses that have BOTH id=12321 AND type "Old".

Maybe i'm taking wrong approach? Any help would be much appreciated.

答案1

得分: 3

使用 gnu-awk，您可以将 RS 变量设置为 </Request> 或 </Response> 作为记录分隔符，然后在 $0 中检查两个搜索项：

awk -v RS='&lt;/Re(quest|sponse)&gt;' '/id=12321/ && /&lt;type&gt;Old/ {print $0 RT}' file

[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

英文:

Using gnu-awk you can set RS variable to </Request> or </Response> as record separator and then check for 2 search terms in $0:

awk -v RS=&#39;&lt;/Re(quest|sponse)&gt;&#39; &#39;/id=12321/ &amp;&amp; /&lt;type&gt;Old/ {print $0 RT}&#39; file
[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

答案2

得分: 2

使用适当的 xml 解析器，可以像这样进行操作：[tag:xidel]:

$ xidel -s --input-format=text file.txt -e '
    for $x in tokenize($raw,"\[.+\]  DEBUG")[.]
    return parse-xml($x)[./*[@id=12321 and type="Old"]]
' --output-node-format=xml --output-node-indent

感谢 Reino 提供的代码。

输出

<Request session="lkjh" id="12321">
  <type>Old</type>
</Request>
<Response session="lkjh" id="12321">
  <type>Old</type>
</Response>

英文:

Like this, with a proper xml parser: [tag:xidel]:

$ xidel -s --input-format=text file.txt -e &#39;
    for $x in tokenize($raw,&quot;\[.+\]  DEBUG&quot;)[.]
    return parse-xml($x)[./*[@id=12321 and type=&quot;Old&quot;]]
&#39; --output-node-format=xml --output-node-indent

Credits to Reino

Output

&lt;Request session=&quot;lkjh&quot; id=&quot;12321&quot;&gt;
  &lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
&lt;Response session=&quot;lkjh&quot; id=&quot;12321&quot;&gt;
  &lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

答案3

得分: 1

常见的解决方案是将记录分隔符RS设置为某个唯一标识新记录的内容，以便在每次迭代中，当前记录由您想要检查的所有行组成（一个条目或相关序列；您的目标"thing"）。您的测试数据中没有包含任何文字方括号，所以这是一个适用于您的示例数据的简单演示：

$ awk 'BEGIN { RS="[" } NR>1 && /id=12321/ && /<type>Old<\/type>/ { print "[" $0 }' <<:
> [23/10/10 14:37:44:527 EST]  DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjh" id=12321>
> <type>Old</type>
> </Request>
> [23/10/10 15:37:44:527 EST]  DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjhab" id=432>
> <type>New</type>
> </Request>
> [23/10/10 16:37:44:527 EST]  DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Response session="lkjh" id=12321>
> <type>Old</type>
> </Response>
> :
[23/10/10 14:37:44:527 EST]  DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 16:37:44:527 EST]  DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>

如果您需要在数据中包含文字方括号，您可以牺牲分隔符行（具有方括号和DEBUG的行），并使用正则表达式，将整行用作分隔符；但这意味着该行的内容将被丢弃作为分隔符，不包含在输出中。（您可以注意到我上面的代码添加回了被作为分隔符“吃掉”的[符号。）

英文:

A common solution is to set the record separator RS to something which uniquely identifies a new record, so that the current record in each iteration consists of all the lines you want to examine (one entry or related sequence; your target "thing"). Your test data didn't contain any literal square brackets so this is a simple demonstration which works for your sample data:

$ awk &#39;BEGIN { RS=&quot;[&quot; } NR&gt;1 &amp;&amp; /id=12321/ &amp;&amp; /&lt;type&gt;Old&lt;\/type&gt;/ { print &quot;[&quot; $0 }&#39; &lt;&lt;\:
&gt; [23/10/10 14:37:44:527 EST]  DEBUG
&gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&gt; &lt;Request session=&quot;lkjh&quot; id=12321&gt;
&gt; &lt;type&gt;Old&lt;/type&gt;
&gt; &lt;/Request&gt;
&gt; [23/10/10 15:37:44:527 EST]  DEBUG
&gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&gt; &lt;Request session=&quot;lkjhab&quot; id=432&gt;
&gt; &lt;type&gt;New&lt;/type&gt;
&gt; &lt;/Request&gt;
&gt; [23/10/10 16:37:44:527 EST]  DEBUG
&gt; &lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&gt; &lt;Response session=&quot;lkjh&quot; id=12321&gt;
&gt; &lt;type&gt;Old&lt;/type&gt;
&gt; &lt;/Response&gt;
&gt; :
[23/10/10 14:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Request session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Request&gt;
[23/10/10 16:37:44:527 EST]  DEBUG
&lt;?xml version=&quot;1.1&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;Response session=&quot;lkjh&quot; id=12321&gt;
&lt;type&gt;Old&lt;/type&gt;
&lt;/Response&gt;

If you need to accommodate literal square brackets in the data as well, you might perhaps sacrifice the separator line (the one with the square brackets and the DEBUG) and use a regex which uses the entire line as the separator; but that then means that the contents of that line will be discarded as a separator, and not included in the output. (You'll notice that my code above adds back the [ which was "eaten" as a separator.)

答案4

得分: 1

使用您提供的任何版本的awk示例，请尝试以下代码。仅使用提供的示例编写和测试。

awk '
/^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
  if(flag2){
    print value
  }
  flag1=flag2=value=""
}
{
  value=(value?value ORS:"") $0
}
/ id=12321>/{
  flag1=1
  next
}
/<type>Old<\/type>/ && flag1{
  flag2=1
}
END{
  if(flag2){
    print value
  }
}
' Input_file

英文:

With your shown samples in any version of awk please try following code. Written and tested with shown samples Only.

awk &#39;
/^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
  if(flag2){
    print value
  }
  flag1=flag2=value=&quot;&quot;
}
{
  value=(value?value ORS:&quot;&quot;) $0
}
/ id=12321&gt;/{
  flag1=1
  next
}
/&lt;type&gt;Old&lt;\/type&gt;/ &amp;&amp; flag1{
  flag2=1
}
END{
  if(flag2){
    print value
  }
}
&#39;   Input_file

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用awk从具有多个条件的txt文件中提取相邻字符串？

问题

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>

答案1

答案2

输出

Output

答案3

答案4

如何在Java中验证文本框字段中的用户名？

在Java中用于拆分数字ID的正则表达式：

正则表达式，将所有非字母字符替换为下划线，保持第一个字母不变。

寻找正则表达式以查找页脚元素

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论

问题

$ grep id=12321 file.txt -B2 -A2 [23/10/10 14:37:44:527 EST] DEBUG <?xml version="1.1" encoding="UTF-8" ?> <Request session="lkjh" id=12321> <type>Old</type> </Request>

答案1

答案2

输出

Output

答案3

答案4

发表评论

$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>