英文:
How to extract adjacent strings from txt file with multiple conditions using awk?
问题
我有这个txt文件
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 15:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjhab" id=432>
<type>New</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
我需要使用awk获取所有具有id=12321和类型"Old"的请求和响应。我以前从未使用过awk,而且我找不到一种方法来获取与带有id的字符串相邻的字符串。
我唯一成功获取多行的方法是使用grep,但仅适用于一个模式。
$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
但使用grep我无法获取具有BOTH id=12321和类型"Old"的请求和响应。
也许我采取了错误的方法?非常感谢任何帮助。
英文:
I have this txt file
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 15:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjhab" id=432>
<type>New</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
I need to use awk to get all requests and responses that have id=12321 AND type "Old".I've never used awk before and i can't find a way to get adjacent strings to the string with id.
The only way i managed to get multiple lines was with grep but only with one pattern.
$ grep id=12321 file.txt -B2 -A2
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
--
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
But with grep i can't get requests and responses that have BOTH id=12321 AND type "Old".
Maybe i'm taking wrong approach? Any help would be much appreciated.
答案1
得分: 3
使用 gnu-awk
,您可以将 RS
变量设置为 </Request>
或 </Response>
作为记录分隔符,然后在 $0
中检查两个搜索项:
awk -v RS='</Re(quest|sponse)>' '/id=12321/ && /<type>Old/ {print $0 RT}' file
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
英文:
Using gnu-awk
you can set RS
variable to </Request>
or </Response>
as record separator and then check for 2 search terms in $0
:
awk -v RS='</Re(quest|sponse)>' '/id=12321/ && /<type>Old/ {print $0 RT}' file
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
答案2
得分: 2
使用适当的 xml
解析器,可以像这样进行操作:[tag:xidel]:
$ xidel -s --input-format=text file.txt -e '
for $x in tokenize($raw,"\[.+\] DEBUG")[.]
return parse-xml($x)[./*[@id=12321 and type="Old"]]
' --output-node-format=xml --output-node-indent
感谢 Reino 提供的代码。
输出
<Request session="lkjh" id="12321">
<type>Old</type>
</Request>
<Response session="lkjh" id="12321">
<type>Old</type>
</Response>
英文:
Like this, with a proper xml
parser: [tag:xidel]:
$ xidel -s --input-format=text file.txt -e '
for $x in tokenize($raw,"\[.+\] DEBUG")[.]
return parse-xml($x)[./*[@id=12321 and type="Old"]]
' --output-node-format=xml --output-node-indent
Credits to Reino
Output
<Request session="lkjh" id="12321">
<type>Old</type>
</Request>
<Response session="lkjh" id="12321">
<type>Old</type>
</Response>
答案3
得分: 1
常见的解决方案是将记录分隔符RS
设置为某个唯一标识新记录的内容,以便在每次迭代中,当前记录由您想要检查的所有行组成(一个条目或相关序列;您的目标"thing")。您的测试数据中没有包含任何文字方括号,所以这是一个适用于您的示例数据的简单演示:
$ awk 'BEGIN { RS="[" } NR>1 && /id=12321/ && /<type>Old<\/type>/ { print "[" $0 }' <<:
> [23/10/10 14:37:44:527 EST] DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjh" id=12321>
> <type>Old</type>
> </Request>
> [23/10/10 15:37:44:527 EST] DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjhab" id=432>
> <type>New</type>
> </Request>
> [23/10/10 16:37:44:527 EST] DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Response session="lkjh" id=12321>
> <type>Old</type>
> </Response>
> :
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
如果您需要在数据中包含文字方括号,您可以牺牲分隔符行(具有方括号和DEBUG的行),并使用正则表达式,将整行用作分隔符;但这意味着该行的内容将被丢弃作为分隔符,不包含在输出中。 (您可以注意到我上面的代码添加回了被作为分隔符“吃掉”的[
符号。)
英文:
A common solution is to set the record separator RS
to something which uniquely identifies a new record, so that the current record in each iteration consists of all the lines you want to examine (one entry or related sequence; your target "thing"). Your test data didn't contain any literal square brackets so this is a simple demonstration which works for your sample data:
$ awk 'BEGIN { RS="[" } NR>1 && /id=12321/ && /<type>Old<\/type>/ { print "[" $0 }' <<\:
> [23/10/10 14:37:44:527 EST] DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjh" id=12321>
> <type>Old</type>
> </Request>
> [23/10/10 15:37:44:527 EST] DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Request session="lkjhab" id=432>
> <type>New</type>
> </Request>
> [23/10/10 16:37:44:527 EST] DEBUG
> <?xml version="1.1" encoding="UTF-8" ?>
> <Response session="lkjh" id=12321>
> <type>Old</type>
> </Response>
> :
[23/10/10 14:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Request session="lkjh" id=12321>
<type>Old</type>
</Request>
[23/10/10 16:37:44:527 EST] DEBUG
<?xml version="1.1" encoding="UTF-8" ?>
<Response session="lkjh" id=12321>
<type>Old</type>
</Response>
If you need to accommodate literal square brackets in the data as well, you might perhaps sacrifice the separator line (the one with the square brackets and the DEBUG) and use a regex which uses the entire line as the separator; but that then means that the contents of that line will be discarded as a separator, and not included in the output. (You'll notice that my code above adds back the [
which was "eaten" as a separator.)
答案4
得分: 1
使用您提供的任何版本的awk
示例,请尝试以下代码。仅使用提供的示例编写和测试。
awk '
/^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
if(flag2){
print value
}
flag1=flag2=value=""
}
{
value=(value?value ORS:"") $0
}
/ id=12321>/{
flag1=1
next
}
/<type>Old<\/type>/ && flag1{
flag2=1
}
END{
if(flag2){
print value
}
}
' Input_file
英文:
With your shown samples in any version of awk
please try following code. Written and tested with shown samples Only.
awk '
/^\[[0-9]{2}\/[0-9]{2}\/[0-9]{2}/{
if(flag2){
print value
}
flag1=flag2=value=""
}
{
value=(value?value ORS:"") $0
}
/ id=12321>/{
flag1=1
next
}
/<type>Old<\/type>/ && flag1{
flag2=1
}
END{
if(flag2){
print value
}
}
' Input_file
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论