英文:
Internal working of replaceAll() function in String class, Java
问题
以下是翻译好的内容:
我已经访问了多个网站,只是为了了解 String 类中使用的任何正则表达式函数(如 split() 和 replaceAll())的内部工作原理。
问题陈述在这里:https://www.hackerearth.com/practice/basic-programming/implementation/basics-of-implementation/practice-problems/algorithm/one-string-no-trouble-37037871/
我的代码:
String s = "abaaccasdraaaadsfd";
s = s.replaceAll("(.)\{1,}", "$17$1");
String[] s2 = s.split("7");
int len = 0;
for(String a : s2) {
if(a.length() > len) {
len = a.length();
}
}
System.out.println(len);
一般的在线代码:
String s = "abaaccasdraaaadsfd";
int count=0;
int max=0;
for(int i=1;i<str.length();i++){
char ch =str.charAt(i-1);
char ch1=str.charAt(i);
if(ch!=ch1){
count++;
if(max<count){
max=count;
}
}else{
count=0;
}
}
System.out.println(max+1);
我想要了解一下,如果正则表达式在内部以 O(n) 的方式运行,其中 n 是字符串的长度,那么我的代码(使用正则表达式)与通用的在线代码(使用 for 循环)在时间复杂度方面是类似的。
提前感谢。
英文:
I have visited multiple websites just to know the internal workings of any regex function used in the String class like split() and replaceAll().
Problem statement is available here: https://www.hackerearth.com/practice/basic-programming/implementation/basics-of-implementation/practice-problems/algorithm/one-string-no-trouble-37037871/
My code:
String s = "abaaccasdraaaadsfd";
s = s.replaceAll("(.)\{1,}", "$17$1");
String[] s2 = s.split("7");
int len = 0;
for(String a : s2) {
if(a.length() > len) {
len = a.length();
}
}
System.out.println(len);
General code online:
String s = "abaaccasdraaaadsfd";
int count=0;
int max=0;
for(int i=1;i<str.length();i++){
char ch =str.charAt(i-1);
char ch1=str.charAt(i);
if(ch!=ch1){
count++;
if(max<count){
max=count;
}
}else{
count=0;
}
}
System.out.println(max+1);
I want to understand if the regex internally operates on O(n) where n is the length of the string then my code (using regex) is similar to general online code (using for loop) with respect to time complexity.
Thanks in advance.
答案1
得分: 1
正则表达式对于如此简单的分析来说过于复杂。
有一种叫作Thompson/NFA正则表达式解析器的东西。这种正则表达式解析器具有O(n+m)的性能,其中n是正则表达式的长度,m是输入字符串的长度。然而,TNFA无法处理反向引用、各种向前/向后查看等问题。一旦你开始在正则表达式中使用这些“TNFA不合格”的特性,事实上就不可能从你的正则表达式引擎中挤出O(n+m)的性能。对此的证明相当简单。这个正则表达式:
/^1?$|^(11+?)\1+$/
将匹配长度为质数且仅由'1'符号组成的输入字符串。它将不会匹配其他内容。
这个任务(检查质数)无法在O(n)的时间内完成。因此,任何能够运行上述正则表达式的正则表达式解析器都不能是O(n+m),证毕。
现在,一个相关的问题是:如果输入的正则表达式仅使用基本特性,以便可以由Thompson/NFA风格的状态机处理,那么Java会使用它,并在其他情况下退回到简单的回溯实现吗?
答案似乎与你的问题无关,因为你在这里使用的是回溯。然而,如果我没记错的话,Java没有附带TNFA实现,它总是会使用回溯。然而,这并没有写入规范中,因此将来的某个版本有可能根据输入正则表达式中使用的特性智能地切换实现方式。当前(截至JDK14)的实现完全支持java.util.regexp.Pattern类的文档中所解释的全部功能集(其中包括TNFA引擎无法执行的功能,如回溯),这意味着某些正则表达式在Java的正则表达式引擎中所需的时间比Thompson/NFA引擎要长上多个数量级。
更多关于Thompson / NFA的信息。
re2j是Thompson/NFA的Java实现。如果您想在Java中获得保证的线性性能(O(n+m)
)正则表达式,请使用re2j。正如数学所示,re2j不支持反向引用和其他一些功能(请查看网站上的列表),因此它无法运行您的(.)\\1{1,}
表达式 - 这是因为从数学角度来看,在O(n)
时间内无法做到这一点。
英文:
Regexes are way too complex for such a simplistic analysis.
There is a thing called a Thompson/NFA regexp parser. Such a regexp parser has O(n+m) performance, where n is the length of the regexp, and m is the length of the input string. However, TNFAs cannot deal with backreferences, various lookaheads/lookbacks, and have other issues. Once you start using these 'TNFA disqualifying' features in regexp, it is in fact impossible to squeeze o(n+m) performance out of your regexp engine. The proof for this is fairly trivial. This regexp:
/^1?$|^(11+?)\1+$/
will match input strings whose length is a prime number and consists solely of '1' symbols. It will fail other things.
This job (check if prime number) cannot be done in O(n). Therefore, any regexp parser that can run the above regexp cannot be O(n+m), QED.
Now, a relevant question would be: If the input regexp uses solely the basic features, such that the regexp could be processed by a Thompson/NFA style state machine, does java use that, falling back to a simple backtracking implementation otherwise?
The answer seems irrelevant to your question because you ARE using backtracking here. However, if memory serves, java doesn't ship with a TNFA implementation and will always use a backtracker. This is, however, not written into the spec, so some future version could feasibly switch imlementations intelligently depending on the features used in the input regexp. The current (as of JDK14) implementation fully supports the entire featureset as explained in the javadoc of the java.util.regexp.Pattern class (which includes features that a TNFA engine cannot do, such as backtracking), and it does mean that certain regexes take many orders of magnitude longer in java's regex engine vs. a Thompson/NFA one.
More info about Thompson / NFA.
re2j is a java implementation of Thompson/NFA. Use this if you want guaranteed linear performance (O(n+m)
) regular expressions in java. As is mathematically dictates, re2j doesn't support backreferences and a few other things (see the site for the list), so it cannot run your (.)\\1{1,}
expression - that's because mathematically speaking it isn't possible to do that in O(n)
time.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论