2023年5月14日 18:47:02go评论68阅读模式

英文:

Performance wise is better to use a split or a matching regex to extract subtext from a string?

问题

我有一个类似这样的字符串：

/good/312321312/bad/3213122131

我必须从中提取两组数字。
我考虑了两种解决方案：要么使用split()，要么简单地编写一个正则表达式来匹配数字。
从性能角度来看，哪种解决方案更好？
如果您有其他建议，请告诉我。

英文:

I have a string like this:

/good/312321312/bad/3213122131

I have to extract the two sets of digits from there.
I thought about two solutions: either using the split() or simply writing a regex to match the digits.
What would be the better solution performance wise?
If you have any other suggestion please tell me so.

答案1

得分: 2

由于创建新字符串意味着复制所有字符，`split` 的隐式 `substring` 操作是这里最昂贵的部分。创建一个数组来容纳所有字符串会增加开销，但与字符串创建相比微不足道。然而，我们可以避免这两者。

```java
static final Pattern NUMBER = Pattern.compile("\\d+");

public static void main(String[] args) {
    String s = "/good/312321312/bad/3213122131";

    long first = -1, second = -1;
    Matcher m = NUMBER.matcher(s);
    if(m.find()) {
        first = Long.parseLong(s, m.start(), m.end(), 10);
        if(m.find()) {
            second = Long.parseLong(s, m.start(), m.end(), 10);
        }
    }

    System.out.println(first + "\t" + second);
}

或者

public static void main(String[] args) {
    String s = "/good/312321312/bad/3213122131";

    LongStream.Builder b = LongStream.builder();
    Matcher m = NUMBER.matcher(s);
    while(m.find()) b.add(Long.parseLong(s, m.start(), m.end(), 10));

    long[] result = b.build().toArray();

    System.out.println(Arrays.toString(result));
}

在性能要求高的情况下，重用已编译的 Pattern 实例非常重要，而不是使用像 String.split 这样的便捷方法，它在操作后会丢弃 Pattern 实例。

显然，这只有在代码被执行多次时才会有影响。但当代码仅执行一次时，性能不会成为问题。

允许跳过 substring 操作的 Long.parseLong 方法自 Java 9 开始存在。但即使在这里使用 Long.parseLong(m.group())，您也会避免为非数字部分创建字符串，并使临时字符串尽可能短，这对优化器友好。


<details>
<summary>英文:</summary>

Since creating a new string implies copying all characters, the implicit `substring` operations of `split` are the most expensive aspect here. Creating an array, to hold all the strings, adds to it, but is minuscule compared to the string creations. Still, we can avoid both.

```java
static final Pattern NUMBER = Pattern.compile(&quot;\\d+&quot;);

public static void main(String[] args) {
    String s = &quot;/good/312321312/bad/3213122131&quot;;

    long first = -1, second = -1;
    Matcher m = NUMBER.matcher(s);
    if(m.find()) {
        first = Long.parseLong(s, m.start(), m.end(), 10);
        if(m.find()) {
            second = Long.parseLong(s, m.start(), m.end(), 10);
        }
    }

    System.out.println(first + &quot;\t&quot; + second);
}

public static void main(String[] args) {
    String s = &quot;/good/312321312/bad/3213122131&quot;;

    LongStream.Builder b = LongStream.builder();
    Matcher m = NUMBER.matcher(s);
    while(m.find()) b.add(Long.parseLong(s, m.start(), m.end(), 10));

    long[] result = b.build().toArray();

    System.out.println(Arrays.toString(result));
}

When performance matters, it’s important to keep and reuse compiled Pattern instances instead of using convenience methods like String.split which throw away the Pattern instance after the operation.

Obviously, this only matters if the code is executed more than once. But when the code is executed only once, its performance wouldn’t matter anyway.

The Long.parseLong method that allows to skip the substring operation exists since Java 9. But even when you use Long.parseLong(m.group()) here, you avoid creating strings for the non-numerical parts and retain the temporary strings as short as possible, which is optimizer-friendly.

答案2

得分: 1

使用拆分方法可能是更高效的方法。我们可以将输入拆分成组件，然后检查每个组件，看它是否是长整型。

    String path = "/good/312321312/bad/3213122131";
    String[] parts = path.split("/");
    List<Long> nums = new ArrayList<>();
    for (String part : parts) {
        try {
            long num = Long.parseLong(part);
            nums.add(num);
        }
        catch (NumberFormatException nfe) {
        }
    }

    System.out.println("Found nums: " + nums);

这将打印：

    Found nums: [312321312, 3213122131]

任何只使用基本字符串函数的解决方案可能优于调用正则表达式引擎的成本。

英文:

Using a split approach might typically be the more efficient approach. We can split the input into components and then check each one to see if it be an long integer.

String path = &quot;/good/312321312/bad/3213122131&quot;;
String[] parts = path.split(&quot;/&quot;);
List&lt;Long&gt; nums = new ArrayList&lt;&gt;();
for (String part : parts) {
    try {
        long num = Long.parseLong(part);
        nums.add(num);
    }
    catch (NumberFormatException nfe) {
    }
}

System.out.println(&quot;Found nums: &quot; + nums);

This prints:

Found nums: [312321312, 3213122131]

Any solution which only uses base string functions might outperform the cost of invoking a regex engine.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Performance wise is better to use a split or a matching regex to extract subtext from a string?

问题

答案1

答案2

Error signing a sample XML using Xades4J: ReferenceNotInitializedException: Cannot resolve element with ID

class junit.framework.TestSuite cannot be cast to class org.junit.jupiter.api.Test

在Flutter中用于匹配3位小数的正则表达式：

Windows CMD 无法调用另一个类。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论