在golang中创建一个后缀树。

huangapple go评论68阅读模式
英文:

Create a suffix tree in golang

问题

我有一个字符串数组,我需要在Golang中创建一个后缀树。
Golang中的SuffixArray无法满足我的需求,因为它只接受字节数组(即单个字符串的数组)。
有人可以提供实现的指导吗?
提前感谢。

英文:

I have an array of strings and I need to create a suffix tree out of it in Golang.
SuffixArray in Golang does not suffice my needs, because it only accepts byte array (i.e of a single string).
Could anybody provide pointers for implementation.
Thanks in advance.

答案1

得分: 8

这是一个使用后缀数组进行自动补全的示例。注意,我使用了一个以\x00为前缀将所有字符串连接在一起的方法,这个前缀在字符串中不会出现。

package main

import (
	"fmt"
	"index/suffixarray"
	"regexp"
	"strings"
)

func main() {
	words := []string{
		"aardvark",
		"happy",
		"hello",
		"hero",
		"he",
		"hotel",
	}
	// 使用\x00作为每个字符串的起始符号
	joinedStrings := "\x00" + strings.Join(words, "\x00")
	sa := suffixarray.New([]byte(joinedStrings))

	// 用户输入了"he"
	match, err := regexp.Compile("\x00he[^\x00]*")
	if err != nil {
		panic(err)
	}
	ms := sa.FindAllIndex(match, -1)

	for _, m := range ms {
		start, end := m[0], m[1]
		fmt.Printf("match = %q\n", joinedStrings[start+1:end])
	}
}

输出结果为:

match = "hello"
match = "hero"
match = "he"
英文:

Here is an example of how to use suffix array to do auto completion. (playground).

Note that I joined all the strings together with a prefix of \x00 which can't occur in the strings first.

package main

import (
	"fmt"
	"index/suffixarray"
	"regexp"
	"strings"
)

func main() {
	words := []string{
		"aardvark",
		"happy",
		"hello",
		"hero",
		"he",
		"hotel",
	}
	// use \x00 to start each string
	joinedStrings := "\x00" + strings.Join(words, "\x00")
	sa := suffixarray.New([]byte(joinedStrings))

	// User has typed in "he"
	match, err := regexp.Compile("\x00he[^\x00]*")
	if err != nil {
		panic(err)
	}
	ms := sa.FindAllIndex(match, -1)

	for _, m := range ms {
		start, end := m[0], m[1]
		fmt.Printf("match = %q\n", joinedStrings[start+1:end])
	}
}

Prints

match = "hello"
match = "hero"
match = "he"

答案2

得分: 1

你想要的是通用后缀树(Generalized Suffix Tree)。构建这样的树的一种简单方法是为每个字符串附加一个不同的结束标记(在任何字符串中都没有使用的符号),将它们连接起来,并为连接后的字符串构建一个普通的后缀树。所以你只需要将 "hello world" 添加到字符串集合中,并使用以下代码来获取包含 "wor" 的字符串:

match, err := regexp.Compile("[^\x00]*wor[^\x00]*")

注意,正确的字符串是 joinedStrings[start:end]

英文:

What you want is called generalized suffix tree. A simple way to build such trees is to append a different end marker(symbols not used in any of the strings) to each strings, concatenate them and build a normal suffix tree for the concatenated string. So you just need to add "hello world" to the string set and use:

match, err := regexp.Compile("[^\x00]*wor[^\x00]*")

to get the strings contain "wor". Note that the correct string is joinedStrings[start:end].

答案3

得分: 0

我创建了一个具有O(n)复杂度的后缀树实现,其中n是字符串的长度:https://github.com/twelvedata/searchindex

更多详细信息请参阅我在Medium上的文章:https://medium.com/twelve-data/in-memory-text-search-index-for-quotes-on-go-5243adc62c26

英文:

I created implementation of suffix tree with O(n) complexity, where n is length of string: https://github.com/twelvedata/searchindex

More details in my article on Medium https://medium.com/twelve-data/in-memory-text-search-index-for-quotes-on-go-5243adc62c26

huangapple
  • 本文由 发表于 2014年3月22日 06:46:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/22570839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定