英文:
Is there no catastrophic backtracking in Go regex?
问题
在今天在regex101.com上测试这个正则表达式时,^([a-z0-9]+(-)*)*([a-z0-9])$
,当我在以下字符串上测试时,出现了“灾难性回溯”错误:
使用PHP风格:
aaaaaaaaaaa-aaaaT
使用Python风格:
aaaaaaaaaaa-aaaaT
使用ECMAScript风格,这个更长的字符串超时了,可能是“灾难性回溯”的迹象:
aaaaaaaaaaa-aaaaaaaaaaaaaaaaaT
使用Java 8风格,这个字符串超时了:
aaaaaaaaaaa-aaaaaaaaaaaaaT
但是使用Go风格,即使是更长的字符串也没有出现错误或超时事件。而是显示no match (0.0ms)
。
那么,在我使用Go时,我可以忽略这个错误/警告吗?
我也对这个问题的原因感兴趣,但上面是我关键的问题。
英文:
While testing this regex today on regex101.com,
^([a-z0-9]+(-)*)*([a-z0-9])$
I got "catastrophic backtracking" error when I tested it on this string:
with flavor PHP:
> aaaaaaaaaaa-aaaaT
with flavor Python:
> aaaaaaaaaaa-aaaaT
with flavor ECMAScript this longer string got a timeout that 'may be an indication of catastrophic backtracking'
> aaaaaaaaaaa-aaaaaaaaaaaaaaaaaT
with flavor Java 8 timeout with string
> aaaaaaaaaaa-aaaaaaaaaaaaaT
but flavor Go gave no error or timeout event with much longer such strings. Instead it shows no match (0.0ms)
So can I ignore that error/warning when my regex is being used in Go?
I am interested in the reason for this too, but above is my key question.
答案1
得分: 6
是的,在使用Go中的正则表达式时,可以安全地忽略“catastrophic backtracking”警告。
Go使用RE2算法进行正则表达式匹配,RE2不使用回溯,因此在Go中不会出现这个问题。
有关正则表达式匹配的替代实现的更多信息,请参考https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times。
Go(RE2)对于输入字符串长度和正则表达式字符串长度具有线性性能:O(mn)。
然而,其他使用回溯的语言/库可能会在正则表达式和输入字符串上具有指数级的运行时间,具体取决于正则表达式和输入字符串。
regex101.com显示运行正则表达式与输入字符串的步骤数,你可以看到随着字符串长度的增加,步骤数呈指数级增长,例如对于正则表达式(a*)*$
和字符串aaaaaaaaaaaaaaaaX
,以及regex101.com上的调试器可以逐步显示模式匹配的执行过程,因此你可以看到回溯必须处理指数级增加的备选项数。
@sln提供了一个替代我的原始正则表达式的方法,以消除指数级回溯。
将原始正则表达式简化为a
和X
,对于输入字符串aaaaaaaaaaaaaaaaZ
,
^(a+X*)*a$
需要大约300,000步(每增加一个a
翻倍),但是
^(aX*)*a$
只需要大约100步。
我不知道有没有一种通用的方法将一个易受攻击的正则表达式映射到一个安全的正则表达式,除非@sln愿意提供服务
原始正则表达式的目的是检查输入字符串是否只包含[a-z0-9]和-
,并以[a-z0-9]开头和结尾。
a
,a-b
,ab--c
,a-b--aa---bbb
,...
英文:
Yes, the catastrophic backtracking warning can be safely ignored when using the regex in Go.
Go uses the RE2 algorithm for regex, and RE2 does not use backtracking so the problem does not arise in Go.
https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times has more information about alternative implementations for regex matching.
Go (RE2) has linear performance against input string length and regex string length: O(mn).
However other languages / libs that do use backtracking can have exponential running time, depending on the regex and the input string.
regex101.com shows the number of steps to run a regex against an input string and you can see the number of steps increase exponentially as you increase the string length for a regex like (a*)*$
with a string like aaaaaaaaaaaaaaaaX
. And the debugger on regex101.com can show the pattern match execution one step at a time, so you can see how backtracking has to handle an exponentially increasing number of alternatives.
@sln provided an alternative to my original regex that removed the exponential backtracking.
Simplifying the before/after regex to a
and X
, for input string aaaaaaaaaaaaaaaaZ
^(a+X*)*a$
takes about 300,000 steps (doubling for each additional a
) but
^(aX*)*a$
takes about 100 steps
I don't know any general way to map a vulnerable regex to a safe regex - unless @sln cares to provide a service
The purpose of the original regex was to check that an input string contains only [a-z0-9] and -
while starting and ending with [a-z0-9].
a
, a-b
, ab--c
, a-b--aa---bbb
, ...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论