Optimizing CPU Usage in Java Regex Matching

huangapple go评论60阅读模式
英文:

Optimizing CPU Usage in Java Regex Matching

问题

我遇到了与正则表达式匹配相关的性能问题,这在我的Java项目中导致了高CPU使用率。尽管我尝试过优化正则表达式,但在多个线程同时调用代码时,性能问题仍然存在。

例如,当同时调用执行正则表达式匹配的方法100次时,CPU使用率会在短时间内飙升到90%。

以下是我的代码的简化示例:

String BUY_PATTERN = ".*\\b(purchase)\\b.*";

private static boolean isMatchPattern(String pattern, String text) {
     return text.matches(BUY_PATTERN);
}

我想减少正则表达式匹配期间的CPU使用率。您能提供更高效的正则表达式模式建议,以实现相同的功能吗?

此外,我看到了一篇文章(提供的链接)讨论了回溯对性能的影响,但我发现很难重写正则表达式以最小化回溯。

感谢您的帮助!

英文:

I encountered a performance issue related to regex matching in my Java project, which resulted in high CPU usage. Despite my attempts at regex optimization, I'm still experiencing performance problems, particularly when multiple threads concurrently invoke the code.

For instance, when calling a method that performs regex matching concurrently 100 times, the CPU usage spikes to 90% for a brief period.

Here's a simplified example of my code:

String BUY_PATTERN =".*\\b(purchase)\\b.*";

private static boolean isMatchPattern(String pattern, String text) {
     return text.matches(BUY_PATTERN);
}

I would like to reduce the CPU usage during regex matching. Can you provide suggestions for more efficient regex patterns that achieve the same functionality?

Additionally, I came across an article (link provided) discussing the impact of backtracking on performance, but I find it challenging to rewrite the regex to minimize backtracking.

Thank you for your assistance!

答案1

得分: 1

以下是翻译的内容:

有一些改变你可以做。

通过创建可重用对象,可以大大减少CPU消耗。

public class Example {
    Pattern pattern = Pattern.compile("\\bpurchase\\b");
    Matcher matcher;

    private boolean isMatchPattern(String text) {
        matcher = pattern.matcher(text);
        return matcher.find();
    }
}

在幕后,每次调用String.matches时都会创建一个新的PatternMatcher对象。

为了解决这个问题,你可以在你的类中创建PatternMatcher字段。
然后,从你的isMatchPattern方法内访问这些字段。

此外,对于正则表达式模式,没有必要捕获文本"purchase",所以你可以删除括号。

另外,Pattern模式的上下文是不符合要求的;它期望在文本的任何地方。与String.matches调用相反,它要求整个参数匹配。所以,你不需要起始和结束的.*,因为它们是多余的。

关于使用String.indexOfString.contains

如果你需要单词边界检查,那么从成语角度来看,这在某种程度上是不合适的,因为你需要进行多次调用。

如果你不需要这种检查,那么这将是一种可行的方法。

作为最终解决方案,你可以创建一个字符数组循环,这更或多或少是Matcher类所做的。

英文:

There are a few changes you can make.

You can greatly reduce the CPU consumption by creating re-usable objects.

public class Example {
    Pattern pattern = Pattern.compile("\\bpurchase\\b");
    Matcher matcher;

    private boolean isMatchPattern(String text) {
        matcher = pattern.matcher(text);
        return matcher.find();
    }
}

Behind the scenes, upon each call of String.matches, a new Pattern and Matcher object is created.

To combat this, you can create Pattern and Matcher fields within your class.
Then, access these fields from within your isMatchPattern method.

Furthermore, for the regular expression pattern, there is no need to capture the text "purchase", so you can remove the parentheses.

Additionally, the context of a Pattern pattern is non-conforming; it's expected to be anywhere within the text.
As opposed to a String.matches call, which requires the entire parameter to match.
So, you don't need the starting and ending .*, as they are redundant.

In regard to using an String.indexOf, or String.contains.

If you require the word-boundary check, then this is somewhat out of the question in terms of an idiomatic approach, as you'd have to make more than one call.

If you don't require the check, then this would be the way to go.

As a final solution, you can create a character array for-loop, which is more or less what the Matcher class does.

huangapple
  • 本文由 发表于 2023年5月17日 16:51:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76270212.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定