Performance wise is better to use a split or a matching regex to extract subtext from a string?

huangapple go评论52阅读模式
英文:

Performance wise is better to use a split or a matching regex to extract subtext from a string?

问题

我有一个类似这样的字符串:

/good/312321312/bad/3213122131

我必须从中提取两组数字。
我考虑了两种解决方案:要么使用split(),要么简单地编写一个正则表达式来匹配数字。
从性能角度来看,哪种解决方案更好?
如果您有其他建议,请告诉我。

英文:

I have a string like this:

/good/312321312/bad/3213122131

I have to extract the two sets of digits from there.
I thought about two solutions: either using the split() or simply writing a regex to match the digits.
What would be the better solution performance wise?
If you have any other suggestion please tell me so.

答案1

得分: 2

由于创建新字符串意味着复制所有字符,`split` 的隐式 `substring` 操作是这里最昂贵的部分。创建一个数组来容纳所有字符串会增加开销,但与字符串创建相比微不足道。然而,我们可以避免这两者。

```java
static final Pattern NUMBER = Pattern.compile("\\d+");
public static void main(String[] args) {
    String s = "/good/312321312/bad/3213122131";

    long first = -1, second = -1;
    Matcher m = NUMBER.matcher(s);
    if(m.find()) {
        first = Long.parseLong(s, m.start(), m.end(), 10);
        if(m.find()) {
            second = Long.parseLong(s, m.start(), m.end(), 10);
        }
    }

    System.out.println(first + "\t" + second);
}

或者

public static void main(String[] args) {
    String s = "/good/312321312/bad/3213122131";

    LongStream.Builder b = LongStream.builder();
    Matcher m = NUMBER.matcher(s);
    while(m.find()) b.add(Long.parseLong(s, m.start(), m.end(), 10));

    long[] result = b.build().toArray();

    System.out.println(Arrays.toString(result));
}

在性能要求高的情况下,重用已编译的 Pattern 实例非常重要,而不是使用像 String.split 这样的便捷方法,它在操作后会丢弃 Pattern 实例。

显然,这只有在代码被执行多次时才会有影响。但当代码仅执行一次时,性能不会成为问题。

允许跳过 substring 操作的 Long.parseLong 方法自 Java 9 开始存在。但即使在这里使用 Long.parseLong(m.group()),您也会避免为非数字部分创建字符串,并使临时字符串尽可能短,这对优化器友好。


<details>
<summary>英文:</summary>

Since creating a new string implies copying all characters, the implicit `substring` operations of `split` are the most expensive aspect here. Creating an array, to hold all the strings, adds to it, but is minuscule compared to the string creations. Still, we can avoid both.

```java
static final Pattern NUMBER = Pattern.compile(&quot;\\d+&quot;);
public static void main(String[] args) {
    String s = &quot;/good/312321312/bad/3213122131&quot;;

    long first = -1, second = -1;
    Matcher m = NUMBER.matcher(s);
    if(m.find()) {
        first = Long.parseLong(s, m.start(), m.end(), 10);
        if(m.find()) {
            second = Long.parseLong(s, m.start(), m.end(), 10);
        }
    }

    System.out.println(first + &quot;\t&quot; + second);
}

or

public static void main(String[] args) {
    String s = &quot;/good/312321312/bad/3213122131&quot;;

    LongStream.Builder b = LongStream.builder();
    Matcher m = NUMBER.matcher(s);
    while(m.find()) b.add(Long.parseLong(s, m.start(), m.end(), 10));

    long[] result = b.build().toArray();

    System.out.println(Arrays.toString(result));
}

When performance matters, it’s important to keep and reuse compiled Pattern instances instead of using convenience methods like String.split which throw away the Pattern instance after the operation.

Obviously, this only matters if the code is executed more than once. But when the code is executed only once, its performance wouldn’t matter anyway.

The Long.parseLong method that allows to skip the substring operation exists since Java 9. But even when you use Long.parseLong(m.group()) here, you avoid creating strings for the non-numerical parts and retain the temporary strings as short as possible, which is optimizer-friendly.

答案2

得分: 1

使用拆分方法可能是更高效的方法。我们可以将输入拆分成组件,然后检查每个组件,看它是否是长整型。

    String path = "/good/312321312/bad/3213122131";
    String[] parts = path.split("/");
    List<Long> nums = new ArrayList<>();
    for (String part : parts) {
        try {
            long num = Long.parseLong(part);
            nums.add(num);
        }
        catch (NumberFormatException nfe) {
        }
    }

    System.out.println("Found nums: " + nums);

这将打印:

    Found nums: [312321312, 3213122131]

任何只使用基本字符串函数的解决方案可能优于调用正则表达式引擎的成本。

英文:

Using a split approach might typically be the more efficient approach. We can split the input into components and then check each one to see if it be an long integer.

<!-- language: java -->

String path = &quot;/good/312321312/bad/3213122131&quot;;
String[] parts = path.split(&quot;/&quot;);
List&lt;Long&gt; nums = new ArrayList&lt;&gt;();
for (String part : parts) {
    try {
        long num = Long.parseLong(part);
        nums.add(num);
    }
    catch (NumberFormatException nfe) {
    }
}

System.out.println(&quot;Found nums: &quot; + nums);

This prints:

Found nums: [312321312, 3213122131]

Any solution which only uses base string functions might outperform the cost of invoking a regex engine.

huangapple
  • 本文由 发表于 2023年5月14日 18:47:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247049.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定