移除所有指定子字符串的出现,即使它们是重叠的。

huangapple go评论70阅读模式
英文:

Removing all occurrences of the specified substring, even overlapping ones

问题

例如,源字符串是"appleappleapplebanana",我想要删除的模式是"appleapple"。

我希望它能删除所有"appleapple",即使它们重叠,这样只剩下"banana"。

如果我使用replaceAll,结果是"applebanana",因为删除第一个后,剩下的部分就是"applebanana"。

期望结果:

输入 模式 结果
"appleapplebanana" "appleapple" "banana"
"appleapplebanana" "appleapple" "banana"
"appleappleapplebanana" "appleapple" "banana"
"applebanana" "appleapple" "applebanana"
"aaabbbaaabbbaaa" "aaabbbaaa" ""(空字符串)

我需要处理任意输入模式,所以仅仅使用replace("apple")是行不通的。

尽管我有一个想法:

  1. 获取所有出现的位置(使用类似KMP的方法)
  2. 将对应的字符标记为"待删除"
  3. 删除标记的字符

不过,我想知道是否有更好(更高级的)的方法来实现这个功能。

我最终根据上面的思路编写了自己的函数,因为似乎没有常见的库或包支持这个功能。

英文:

For example, the source string is "appleappleapplebanana" and pattern I want to delete "appleapple".

I want it to delete all "appleapple" even if they overlap, so that only "banana" is left.

appleappleapplebanana
^^^^^^^^^^              <-first  occurrence
     ^^^^^^^^^^         <-second occurrence     

If I use replaceAll, the result is "applebanana" since after deleting the first one, the remaining part is just "applebanana".

Expected results:

Input Pattern Result
"appleapplebanana" "appleapple" "banana"
"appleapplebanana" "appleapple" "banana"
"appleappleapplebanana" "appleapple" "banana"
"applebanana" "appleapple" "applebanana"
"aaabbbaaabbbaaa" "aaabbbaaa" ""(empty string)

I need to process arbitrary input patterns, so just using replace("apple") wouldn't work.

Though I have an idea for this:

  1. Get all occurences (using something like KMP)
  2. Mark corresponding characters as "to-be deleted"
  3. Delete marked characters

However, I would like to know if there is a better (<s>fancier</s> ready made) way to achieve this.
<br>
<br>
<br>

I ended up making my own function using the idea above, since no common libraries nor packages seems to support this feature.

答案1

得分: 2

这个问题一开始有点令人困惑。在更新之后,我认为最好的示例来说明这个问题是在aaabbbaaabbbaaa中匹配"pattern" aaabbbaaa

aaabbbaaabbbaaa
aaabbbaaa
      aaabbbaaa
      ^-^        < 重叠部分
^-------------^  < 匹配这部分:'aaa' 重叠

如果可以在正则表达式中使用"pattern"字符串的长度,则可以使用回顾后查找

.{1,9}(?<=aaabbbaaa)

这个正则表达式(演示)将匹配从一个到字符串长度的字符,只要aaabbbaaa在前面。这将匹配aaabbbaaa,但也会匹配bbbaaa,因为最后一个a也是由aaabbbaaa前导的,并且由于长度限制,它不会跳过任何其他子字符串。它还会在aaabbbaaaaaabbbaaa匹配不重叠部分,但会在aaabbbaaacccaaabbbaaa保留例如ccc

tio.run上的Java演示中,包括长度:

String regex = ".{1," + pat.length() + "}(?<=(" + pat + "))";
Pattern p = Pattern.compile(regex);
String result = p.matcher(str).replaceAll("");

更新,包括部分@markalex的想法:为了提高性能,特别是对于较长的输入,首先匹配一次"pattern",然后将回顾后查找部分包装到一个重复的组中(regex101演示)。

aaabbbaaa(?:.{1,9}(?<=aaabbbaaa))*

这也将导致获得相邻部分的一次匹配,这也可能是所需的。此外,如果输入包含非单词字符,您可以使用\w(单词字符)代替点。

英文:

The question was a bit confusing at first. After the updates I think the best provided example to illustrate the problem is matching the "pattern" aaabbbaaa in aaabbbaaabbbaaa.

aaabbbaaabbbaaa
aaabbbaaa
      aaabbbaaa
      ^-^        &lt; overlapping part
^-------------^  &lt; match this part: &#39;aaa&#39; is overlapping

If length of the "pattern"-string may be used in the regex, a lookbehind could be used:

.{1,9}(?&lt;=aaabbbaaa)

This regex (demo) will match from one to the strings length characters as long as aaabbbaaa is behind. So that will match aaabbbaaa but also bbbaaa because the last a is also preceded by aaabbbaaa and due to the length restriction it will not skip over any other substring. It will also match non-overlaps in aaabbbaaaaaabbbaaa but leave e.g. ccc in aaabbbaaacccaaabbbaaa.

A Java demo at tio.run with incorporating the length:

String regex = &quot;.{1,&quot; + pat.length() + &quot;}(?&lt;=&quot; + pat + &quot;)&quot;;
Pattern p = Pattern.compile(regex);
String result = p.matcher(str).replaceAll(&quot;&quot;);

Update including parts of @markalex idea: For better performance, especially with longer inputs first match the "pattern" once and wrap the lookbehind part into a repeated group (regex101 demo).

aaabbbaaa(?:.{1,9}(?&lt;=aaabbbaaa))*

This will also lead to getting one match for the adjacent parts which might be desired anyways. Further you can use \w (word character) instead of the dot if input contains non-word characters.

答案2

得分: 0

这在技术上是重叠的。

appleapple
appleappleappleapple
appleapple


而这是重复的。

```none
appleapple
     appleapple
          appleapple

尽管如此,您可以将后者称为“具有重叠”。

这在本质上不是被视为具有重复特性的模式的属性。

在这一点上,它是固有的 - 冗余的 - 它只是一种描述。

除了String#replace之外,还有String#replaceAll

它使用正则表达式模式作为第一个参数。

您可以使用以下模式来替换具有重叠的重复值。

(apple)+
replaceAll(&quot;(apple)\\1+&quot;, &quot;&quot;)

我不确定是否有一种方法可以使用单一模式删除重叠的值。

我想这会更加复杂。

您提到了“...标记相应的字符为 '待删除'”。

这很可能是删除真正重叠值的逻辑方式。


<details>
<summary>英文:</summary>

Technically, this is over-lapping.

appleapple
appleappleappleapple
appleapple


And, this is repeating.

```none
appleapple
     appleapple
          appleapple

Although, you could refer to the latter as, having over-lapped.
Which, intrinsically, is not a property of a pattern that is considered to have a repeating quality.
It would be inherent at that point&mdash;redundant&mdash;it's just a description.

In addition to String#replace there is also String#replaceAll.
It uses a regular expression pattern as the first argument.

You could use the following pattern to replace repeating values that have over-lapped.

(apple)+
replaceAll(&quot;(apple)\\1+&quot;, &quot;&quot;)

I'm not sure if there is a way to remove over-lapping values using a single pattern.
I imagine it would be much more complex.

You mentioned "... mark corresponding characters as 'to-be deleted'".
This would most likely be the logical way to remove truly over-lapping values.

huangapple
  • 本文由 发表于 2023年6月1日 22:58:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76383244.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定