如何在Go语言中找到最长匹配的子字符串

huangapple go评论89阅读模式
英文:

How to find the longest matching substring in Go

问题

找到最长子字符串匹配的正确方法是什么?
我一直在尝试使用Go语言的regexp包来找到最长的子字符串匹配:https://pkg.go.dev/regexp

具体来说,我想找到字符串中最长的连续元音字母匹配。可以是任意长度或组合的 "aeiou"。

以下是我目前的代码。它在字符串长度过长时会导致恐慌(超过1000)。我从最长的子字符串开始搜索,并在找到匹配后返回。

s := "chrononhotonthuooaos"
for i := utf8.RuneCountInString(s); i >= 0; i-- {
    exp := "[aeiou]{"
    exp += fmt.Sprint(i)
    exp += "}"
    regexStr := regexp.MustCompile(exp)
    longestStr := regexStr.FindStringSubmatch(s)
    if longestStr != nil {
        fmt.Println("longest submatch found:", longestStr)
        return utf8.RuneCountInString(longestStr[0])
    }
}
return 0
英文:

What's the correct way to find the longest substring match?
I’ve been trying to find the longest substring match with the regexp package in Go: https://pkg.go.dev/regexp

Specifically the longest consecutive match of vowels in a string. Any length or combination of "aeiou".

Here's the code I have now. It works until the strings get too long resulting in a panic (over 1000). I start searching for the longest substring and return once a match is found.

s := "chrononhotonthuooaos"
for i := utf8.RuneCountInString(s); i >=0; i-- {
    exp := "[aeiou]{"
    exp += fmt.Sprint(i)
    exp += "}"
    regexStr := regexp.MustCompile(exp)
    longestStr := regexStr.FindStringSubmatch(s)
    if longestStr != nil {
        fmt.Println("longest submatch found:", longestStr)
        return utf8.RuneCountInString(longestStr[0])
    }
}
return 0

答案1

得分: 1

你之所以出现恐慌是因为你在每次迭代中都创建了一个新的正则表达式 - 一个长度为10,000的字符串?这意味着你创建了10,000个编译后的正则表达式,我相信这些会被缓存起来。更不用说正则表达式的编译是很耗费资源的。

那么为什么要使用正则表达式呢?像这样的方法几乎肯定更快,不会创建任何中间对象,并且以O(n)的时间复杂度和O(1)的空间复杂度运行。它适用于任何长度的字符串:

func longest_vowel_run(s string) (string, int, int) {
    slen := len(s)
    bgn := 0
    end := 0
    max := 0
    x := 0
    y := 0

    // while we still have chars left...
    for x < len(s) {

        // find the first vowel
        for x < slen && !vowel[s[x]] {
            x++
        }
        // find the next non-vowel
        for y = x; y < slen && vowel[s[y]]; {
            y++
        }

        // if the current run is longer than the max run, update the max run
        if z := y - x; z > max {
            bgn = x
            end = y
            max = z
        }

        // pick up where we left off
        x = y

    }

    var maxRun string
    if max > 0 {
        maxRun = s[bgn:end]
    }

    return maxRun, bgn, end - 1
}

var vowel = map[byte]bool{
    'a': true, 'A': true,
    'e': true, 'E': true,
    'i': true, 'I': true,
    'o': true, 'O': true,
    'u': true, 'U': true,
}

你可以在这里查看代码示例:https://goplay.tools/snippet/nvZwWeEF8bC

英文:

The reason you are getting panics is because your are creating a new regex on every iteration — a 10,000 character string? That's 10,000 compiled regexen, and I'm pretty sure those get cached. Not to mention that regex compilation is expensive.

So why use a regex at all? Something like this is almost certainly faster, doesn't create any intermediate objects, and runs in O(n) time and O(1) space complexity. It will work for strings of any length whatsoever:

https://goplay.tools/snippet/nvZwWeEF8bC

func longest_vowel_run(s string) (string, int, int) {
	slen := len(s)
	bgn := 0
	end := 0
	max := 0
	x := 0
	y := 0

	// while we still have chars left...
	for x &lt; len(s) {

		// find the first vowel
		for x &lt; slen &amp;&amp; !vowel
展开收缩
] { x++ } // find the next non-vowel for y = x; y &lt; slen &amp;&amp; vowel
展开收缩
]; { y++ } // if the current run is longer than the max run, update the max run if z := y - x; z &gt; max { bgn = x end = y max = z } // pick up where we left off x = y } var maxRun string if max &gt; 0 { maxRun = s[bgn:end] } return maxRun, bgn, end - 1 } var vowel = map[byte]bool{ &#39;a&#39;: true, &#39;A&#39;: true, &#39;e&#39;: true, &#39;E&#39;: true, &#39;i&#39;: true, &#39;I&#39;: true, &#39;o&#39;: true, &#39;O&#39;: true, &#39;u&#39;: true, &#39;U&#39;: true, }

答案2

得分: 0

你可以使用以下代码:

re := regexp.MustCompile(`[aeiou]+`)

matches := re.FindAllStringSubmatch(s, -1)
if len(matches) > 0 {
	longest := 0
	for i := range matches {
		if len(matches[i]) >= len(matches[longest]) {
			longest = i
		}
	}
	fmt.Println(matches[longest])
}

或者,不使用正则表达式,你可以使用以下代码:

s := "chrononhotonthuooaos"

var start, end int
for i := 0; i < len(s); i++ {
	switch s[i] {
	case 'a', 'e', 'i', 'o', 'u':
		j := i + 1
		for ; j < len(s); j++ {
			switch s[j] {
			case 'a', 'e', 'i', 'o', 'u':
				continue
			}
			break
		}
		if j-i > end-start {
			start, end = i, j
		}
		i = j
	}
}

fmt.Println(s[start:end])

你可以在以下链接中查看代码的运行结果:

https://go.dev/play/p/loxIsR3LMOM

https://go.dev/play/p/XT5-pr3n0tl

英文:

You can do the following:

re := regexp.MustCompile(`[aeiou]+`)

matches := re.FindAllStringSubmatch(s, -1)
if len(matches) &gt; 0 {
	longest := 0
	for i := range matches {
		if len(matches[i]) &gt;= len(matches[longest]) {
			longest = i
		}
	}
	fmt.Println(matches[longest])
}

https://go.dev/play/p/loxIsR3LMOM


Or, without regexp, you can do:

s := &quot;chrononhotonthuooaos&quot;

var start, end int
for i := 0; i &lt; len(s); i++ {
	switch s[i] {
	case &#39;a&#39;, &#39;e&#39;, &#39;i&#39;, &#39;o&#39;, &#39;u&#39;:
		j := i + 1
		for ; j &lt; len(s); j++ {
			switch s[j] {
			case &#39;a&#39;, &#39;e&#39;, &#39;i&#39;, &#39;o&#39;, &#39;u&#39;:
				continue
			}
			break
		}
		if j-i &gt; end-start {
			start, end = i, j
		}
		i = j
	}
}

fmt.Println(s[start:end])

https://go.dev/play/p/XT5-pr3n0tl

答案3

得分: 0

这是我的答案。虽然它不像其他人那样优雅,但对我来说有效。https://go.dev/play/p/ZpZtqt92lvc

package main

import (
	"fmt"
	"regexp"
)

func main() {

	word := "chrononhotonthuooaos"
	//word := "aeoihddkdhuaiahdnaanbcvuuooaaiieendjahaaaiiie"

	// 只允许元音字母(一个或多个)
	re := regexp.MustCompile(`[aeiou]+`)

	// 存储最长的元音字母片段
	max_length := 0

	// 最长片段在`vowels_matches`中的位置
	position := 0

	// 类型 ~~~~> [][]byte{} 所有在另一个中的重叠片段
	vowels_matches := re.FindAll([]byte(word), -1)

	for i := 0; i < len(vowels_matches); i++ {

		// 寻找最长的元音字母片段
		// 比较它们的长度

		if len(vowels_matches[i]) > max_length {

			// 如果找到比max_length更长的片段,max_length将取其值
			max_length = len(vowels_matches[i])

			// 我们还想知道位置,在那之后具体选择该片段
			position = i
		}
	}

	// 获取该字节片段(最长的)并转换为字符串
	result := string(vowels_matches[position])

	fmt.Printf("%v 类型 ~~~> %T\n", result, result)

}

另一种选项是不使用正则表达式,并借助一个小的递归函数来实现:https://go.dev/play/p/7sssSWv-0UL

package main

import (
	"fmt"
)

func main() {

	word := "chrononhotonthuooaos"

	pattern := "aeiou"

	// 我们只想要`word`变量中的所有元音字母
	// 所以`vowels`将附加到`word`中的位置
	vowels := []int{}

	// 元音字母匹配的片段的片段切片
	vowels_matches := [][]int{}

	// 存储元音字母匹配片段的最大长度
	max_length := 0

	// 最长片段在`vowels_matches`中的位置
	position := 0

	final_string_result := ""

	// 从`word`中附加元音字母
	for i, vowel := range word {
		for _, v := range pattern {
			if v == vowel {
				vowels = append(vowels, i) // 附加该元音字母的位置
			}
		}
	}

	vowels_matches = recursive(int(vowels[0]), vowels, vowels_matches)

	for i := 0; i < len(vowels_matches); i++ {
		if len(vowels_matches[i]) > max_length {
			max_length = len(vowels_matches[i])
			position = i
		}
	}

	for _, e := range vowels_matches[position] {
		final_string_result += string(word[e])
	}

	fmt.Println(final_string_result)

}

func recursive(valid int, vowels []int, vowels_matches [][]int) [][]int {

	temp := []int{}

	for i, value := range vowels {
		if value == valid+1 || i == 0 {
			valid = value
			temp = append(temp, value)
		} else {
			next_last := int(vowels[i:][0])
			new_slice := vowels[i:]
			vowels_matches = recursive(next_last, new_slice, vowels_matches)
		}
	}

	vowels_matches = append(vowels_matches, temp)

	return vowels_matches
}
英文:

Here is my answer. And although it is not as elegant as that of the companions, it works for me. https://go.dev/play/p/ZpZtqt92lvc

package main
import (
&quot;fmt&quot;
&quot;regexp&quot;
)
func main() {
word := &quot;chrononhotonthuooaos&quot;
//word := &quot;aeoihddkdhuaiahdnaanbcvuuooaaiieendjahaaaiiie&quot;
// only vowels allowed (one or more)
re := regexp.MustCompile(`[aeiou]+`)
// to store max length slice of vowel macthes
max_length := 0
// position of the max lenghth slice inside `vowels_matches`
position := 0
// type ~~~~&gt; [][]byte{}  all coincident slices inside another
vowels_matches := re.FindAll([]byte(word), -1)
for i := 0; i &lt; len(vowels_matches); i++ {
// Looking for the longest slice with vowels
// so we are going to compare its lengths
if len(vowels_matches[i]) &gt; max_length {
//if we find an slice longer max_length will take its value
max_length = len(vowels_matches[i])
// also we want to know the position, to pick after that slice in concret
position = i
}
}
// get that slice of bytes (the longest) an convert to string
result := string(vowels_matches[position])
fmt.Printf(&quot;%v type ~~~&gt; %T\n&quot;, result, result)

}

Another option without using regex and with the help of little recursive function:https://go.dev/play/p/7sssSWv-0UL

package main
import (
&quot;fmt&quot;
)
func main() {
word := &quot;chrononhotonthuooaos&quot;
pattern := &quot;aeiou&quot;
// we only want all vowels inside `word` variable
// so `vowels` will append the positions inside `word`
vowels := []int{}
// slice of slices of vowel matches
vowels_matches := [][]int{}
// to store max length of vowel macthes slices
max_length := 0
// position of the max lenghth slice inside `vowels_matches`
position := 0
final_string_result := &quot;&quot;
// append only vowels from `word`
for i, vowel := range word {
for _, v := range pattern {
if v == vowel {
vowels = append(vowels, i) //append the position of that vowel
}
}
}
vowels_matches = recursive(int(vowels[0]), vowels, vowels_matches)
for i := 0; i &lt; len(vowels_matches); i++ {
if len(vowels_matches[i]) &gt; max_length {
max_length = len(vowels_matches[i])
position = i
}
}
for _, e := range vowels_matches[position] {
final_string_result += string(word[e])
}
fmt.Println(final_string_result)
}
func recursive(valid int, vowels []int, vowels_matches [][]int) [][]int {
temp := []int{}
for i, value := range vowels {
if value == valid+1 || i == 0 {
valid = value
temp = append(temp, value)
} else {
next_last := int(vowels[i:][0])
new_slice := vowels[i:]
vowels_matches = recursive(next_last, new_slice, vowels_matches)
}
}
vowels_matches = append(vowels_matches, temp)
return vowels_matches
}

huangapple
  • 本文由 发表于 2022年2月25日 03:10:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/71257080.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定