英文:
Why is this regular expression so slow in Java?
问题
最近,SonarQube规则 (https://rules.sonarsource.com/java/RSPEC-4784) 提醒我注意到一些性能问题,这些问题可以用于对抗 Java 正则表达式实现的拒绝服务攻击。
事实上,以下的 Java 测试显示了错误的正则表达式可以有多慢:
import org.junit.Test;
public class RegexTest {
@Test
public void fastRegex1() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
}
@Test
public void fastRegex2() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
}
@Test
public void slowRegex() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
}
}
正如你所见,前两个测试很快,第三个测试在 Java 8 中非常慢。
然而,相同的数据和正则表达式在 Perl 或 Python 中并不慢,这让我想知道为什么在 Java 中评估这个正则表达式如此缓慢。
$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
real 0m0.004s
user 0m0.000s
sys 0m0.004s
$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab
real 0m0.018s
user 0m0.015s
sys 0m0.004s
是什么导致了数据中的额外匹配修饰符 +
或尾随字符 s
使得这个正则表达式如此缓慢,为什么这只发生在 Java 中?
英文:
I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.
Indeed, the following Java test shows how slow the wrong regular expression can be:
import org.junit.Test;
public class RegexTest {
@Test
public void fastRegex1() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
}
@Test
public void fastRegex2() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
}
@Test
public void slowRegex() {
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
}
}
As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)
The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.
$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
real 0m0.004s
user 0m0.000s
sys 0m0.004s
$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab
real 0m0.018s
user 0m0.015s
sys 0m0.004s
What is it about the extra matching modifier +
or trailing character s
in the data which makes this regular expression so slow, and why is it only specific to Java?
答案1
得分: 54
请注意:我对正则表达式的内部机制了解不多,以下只是猜测。我无法回答为什么Java受到影响,而其他语言不受影响(此外,当我运行它时,它比您在jshell 11中运行的12秒要快得多,因此可能只影响某些版本)。
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")
有许多种方式可以匹配许多个a
:
(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
等等。
对于输入字符串"aaaaaaaaaaaaaaaaaaaaaaaaaaaab"
,它将贪婪地一次性匹配所有这些a
,然后匹配b
,完成任务。
对于"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"
,当它到达字符串末尾并发现字符串不匹配(因为有个s
),它没有正确地识别到s
意味着它永远不会匹配。因此,它在经历了可能的匹配,如下所示:
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs
它认为:“哦,也许失败是因为我对a
的分组方式不对”,然后回头尝试所有其他a
的组合。
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs // 不匹配
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs // 不匹配
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs // 不匹配
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs // 不匹配
(aaaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs // 不匹配
(aaaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs // 不匹配
(aaaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs // 不匹配
...
有很多这样的组合(我认为对于28个a
,大约有2^27(即134,217,728)种组合,因为每个a
可以是前一个组的一部分,或者可以开始自己的组),所以这需要很长时间。
英文:
Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).
"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")
There are lots of ways that lots of a
s could match:
(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.
For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab"
, it will greedily match all of those a
s in a single pass, match the b
, job done.
For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"
, when it gets to the end and finds that the string doesn't match (because of the s
), it's not correctly recognizing that the s
means it can never match. So, having gone through and likely matched as
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs
it thinks "Oh, maybe it failed because of the way I grouped the a
s - and goes back and tries all the other combinations of the a
s.
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs // ...
...
There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 a
s, because each a
can either be part of the previous group, or start its own group), so it takes a long time.
答案2
得分: 20
我不太了解Perl,但Python版本与Java版本不等效。您正在使用search()
,但Java版本使用matches()
。Python中的等效方法将是fullmatch()
。
当我在Python(3.8.2)中使用search()
运行您的示例时,我得到了快速的结果,就像您一样。但当我使用fullmatch()
运行它时,执行时间很长(几秒钟)。您的Perl示例是否也没有进行完全匹配?
顺便说一下:如果您想尝试Java版本的search
,您可以使用以下代码:
Pattern.compile("a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();
语义可能有些细微差别,但对于这个目的来说应该足够接近。
英文:
I don't know Perl too well but the Python version is not equivalent to the Java one. You are using search()
but the Java version is using matches()
. The equivalent method in Python would be fullmatch()
When I run your examples in Python (3.8.2) with search()
I get quick results as you do. When I run it with fullmatch()
I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match?
BTW: if you want to try the Java version of search you would use:
Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();
There might be some slight difference in the semantics but it should be close enough for this purpose.
答案3
得分: 14
额外的 +
在字符串无法匹配时会导致大量的回溯(在简单的正则表达式实现中)。如果字符串可以匹配,答案在第一次尝试时就已知。这解释了为什么情况2很快,只有情况3很慢。
英文:
The extra +
causes a lot of backtracking (in a naive regexp implementation) when the string cannot be matched. If the string can be matched, the answer is known in the first try. This explains why the case 2 is fast and only case 3 is slow.
答案4
得分: 9
这个网站https://swtch.com/~rsc/regexp/regexp1.html提供了有关正则表达式实现技巧及其背后的理论的详细信息。我知道仅提供链接的回答并不好,但这值得阅读,它展示了一个使用更好的实现方式完成的正则表达式示例,耗时30微秒,而使用更为常见和明显的方式则需要60秒(慢2百万倍)。
它说:
“今天,正则表达式也成为了忽视良好理论导致糟糕程序的典型示例。与今天流行的工具使用的正则表达式实现相比,许多那些三十年前的Unix工具中使用的实现要慢得多。”
其他回答中提到额外的“+”会导致过多的回溯,这是正确的,但只有当你忽视了良好的理论时才会如此。
英文:
The site https://swtch.com/~rsc/regexp/regexp1.html has some detailed information on regular expression implementation techniques and the theory behind them. I know link only answers are bad, but this is worth reading, showing an example regular expression that completes in 30 micro seconds with the better implementation, and 60 seconds (2 million times slower) with the better known and more obvious way.
It says
> "Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."
Other answers saying that the extra +
causes too much backtracking are correct, but only if you ignore the good theory.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论