Lookahead捕获到不需要的字符。

huangapple go评论62阅读模式
英文:

Lookahead captures unwanted characterrs

问题

我正在尝试捕获来自防火墙的警报名称。每个日志都具有以下格式:

datetime alertname severity_level username endpoint_name domain

我目前正在使用的正则表达式对所有日志都有效,除了第三个。有任何修复它的想法吗?

regex = []

text = """2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1"""

pattern = '(?:(?<=\d{2}:\d{2}:\d{2}))(.*)(?=.)|(?=medium )|(?=high )|(?=low )|(?=critical )'
regex.append(re.findall(pattern,text,re.MULTILINE))
print(regex)

当前输出

[[' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '', ' User account locked out multiple login errors high SRVDC2$ john.smith', ' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted']]

预期输出

[[' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '', ' User account locked out multiple login errors', ' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted']]

英文:

I'm trying to capture alert names that come from the firewall. Each log has the following format:

datetime alertname severity_level username endpoint_name domain

The current RegEx I'm using works for all logs except for the third one. Any ideas on how to fix it?

regex = []

text = &quot;&quot;&quot;2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\\\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1&quot;&quot;&quot;

pattern = &#39;(?:(?&lt;=\d{2}:\d{2}:\d{2}))(.*)(?=\.)|(?=medium )|(?=high )|(?=low )|(?=critical )&#39;
regex.append(re.findall(pattern,text,re.MULTILINE))
print(regex)

Current Output

[[&#39; Computer account added/changed/deleted&#39;, &#39;&#39;, &#39; Computer account added/changed/deleted&#39;, &#39;&#39;, &#39; User account locked out multiple login errors high SRVDC2$ john.smith&#39;, &#39; Computer account added/changed/deleted&#39;, &#39;&#39;, &#39; Computer account added/changed/deleted&#39;, &#39;&#39;]]

Expected Output

[[&#39; Computer account added/changed/deleted&#39;, &#39;&#39;, &#39; Computer account added/changed/deleted&#39;, &#39;&#39;, &#39; User account locked out multiple login errors&#39;, &#39; Computer account added/changed/deleted&#39;, &#39;&#39;, &#39; Computer account added/changed/deleted&#39;, &#39;&#39;]]

答案1

得分: 3

你可以使用

\d{2}:\d{2}:\d{2}\s+
(.*?)
\s(?:medium|high|low|critical)

在你的原始尝试相反,这个使用了一个非捕获组(回溯是“昂贵的”!)和之后的一个懒惰量词构造。只使用第一个捕获组。

Python 中可以这样做

import re

text = &quot;&quot;&quot;2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\\\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1&quot;&quot;&quot;

pattern = re.compile(r&#39;&#39;&#39;
    \d{2}:\d{2}:\d{2}\s+
    (.*?)
    \s(?:medium|high|low|critical)

&#39;&#39;&#39;, re.VERBOSE)

messages = [match.group(1) for match in pattern.finditer(text)]
print(messages)

会产生

[&#39;Computer account added/changed/deleted.&#39;, &#39;Computer account added/changed/deleted.&#39;, &#39;User account locked out multiple login errors&#39;, &#39;Computer account added/changed/deleted.&#39;, &#39;Computer account added/changed/deleted.&#39;]
英文:

You could use

\d{2}:\d{2}:\d{2}\s+
(.*?)
\s(?:medium|high|low|critical)

See a demo on regex101.com.

In contrast to your original attempt, this one uses a non-capturing group (lookbehinds are "expensive"!) and a lazy quantifier construct afterwards. Just use the first capturing group.

In Python this could be

import re

text = &quot;&quot;&quot;2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\\\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1&quot;&quot;&quot;

pattern = re.compile(r&#39;&#39;&#39;
    \d{2}:\d{2}:\d{2}\s+
    (.*?)
    \s(?:medium|high|low|critical)

&#39;&#39;&#39;, re.VERBOSE)

messages = [match.group(1) for match in pattern.finditer(text)]
print(messages)

And would yield

[&#39;Computer account added/changed/deleted.&#39;, &#39;Computer account added/changed/deleted.&#39;, &#39;User account locked out multiple login errors&#39;, &#39;Computer account added/changed/deleted.&#39;, &#39;Computer account added/changed/deleted.&#39;]

See a demo on ideone.com.

huangapple
  • 本文由 发表于 2023年5月28日 14:48:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76350282.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定