英文:
Lookahead captures unwanted characterrs
问题
我正在尝试捕获来自防火墙的警报名称。每个日志都具有以下格式:
datetime alertname severity_level username endpoint_name domain
我目前正在使用的正则表达式对所有日志都有效,除了第三个。有任何修复它的想法吗?
regex = []
text = """2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1"""
pattern = '(?:(?<=\d{2}:\d{2}:\d{2}))(.*)(?=.)|(?=medium )|(?=high )|(?=low )|(?=critical )'
regex.append(re.findall(pattern,text,re.MULTILINE))
print(regex)
当前输出
[[' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '', ' User account locked out multiple login errors high SRVDC2$ john.smith', ' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted']]
预期输出
[[' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '', ' User account locked out multiple login errors', ' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted']]
英文:
I'm trying to capture alert names that come from the firewall. Each log has the following format:
datetime alertname severity_level username endpoint_name domain
The current RegEx I'm using works for all logs except for the third one. Any ideas on how to fix it?
regex = []
text = """2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\\\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1"""
pattern = '(?:(?<=\d{2}:\d{2}:\d{2}))(.*)(?=\.)|(?=medium )|(?=high )|(?=low )|(?=critical )'
regex.append(re.findall(pattern,text,re.MULTILINE))
print(regex)
Current Output
[[' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '', ' User account locked out multiple login errors high SRVDC2$ john.smith', ' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '']]
Expected Output
[[' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '', ' User account locked out multiple login errors', ' Computer account added/changed/deleted', '', ' Computer account added/changed/deleted', '']]
答案1
得分: 3
你可以使用
\d{2}:\d{2}:\d{2}\s+
(.*?)
\s(?:medium|high|low|critical)
在你的原始尝试相反,这个使用了一个非捕获组(回溯是“昂贵的”!)和之后的一个懒惰量词构造。只使用第一个捕获组。
在 Python
中可以这样做
import re
text = """2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\\\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1"""
pattern = re.compile(r'''
\d{2}:\d{2}:\d{2}\s+
(.*?)
\s(?:medium|high|low|critical)
''', re.VERBOSE)
messages = [match.group(1) for match in pattern.finditer(text)]
print(messages)
会产生
['Computer account added/changed/deleted.', 'Computer account added/changed/deleted.', 'User account locked out multiple login errors', 'Computer account added/changed/deleted.', 'Computer account added/changed/deleted.']
英文:
You could use
\d{2}:\d{2}:\d{2}\s+
(.*?)
\s(?:medium|high|low|critical)
In contrast to your original attempt, this one uses a non-capturing group (lookbehinds are "expensive"!) and a lazy quantifier construct afterwards. Just use the first capturing group.
In Python
this could be
import re
text = """2023-05-27 / 23:06:31 Computer account added/changed/deleted. medium ANONYMOUS LOGON PC-CR5$ SRVDC2 ACME 1
2023-05-27 / 23:28:08 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVXAP02$ SRVDC2 ACME 1
2023-05-28 / 02:24:29 User account locked out multiple login errors high SRVDC2$ john.smith.admin SRVDC2 \\\\NECBROWSER 1
2023-05-28 / 05:01:48 Computer account added/changed/deleted. medium ANONYMOUS LOGON SRVNPS01$ SRVDC1 ACME 1
2023-05-28 / 06:38:57 Computer account added/changed/deleted. medium ANONYMOUS LOGON VD-OPERATOR1$ SRVDC1 ACME 1"""
pattern = re.compile(r'''
\d{2}:\d{2}:\d{2}\s+
(.*?)
\s(?:medium|high|low|critical)
''', re.VERBOSE)
messages = [match.group(1) for match in pattern.finditer(text)]
print(messages)
And would yield
['Computer account added/changed/deleted.', 'Computer account added/changed/deleted.', 'User account locked out multiple login errors', 'Computer account added/changed/deleted.', 'Computer account added/changed/deleted.']
See a demo on ideone.com.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论