在Golang中,最快的方法是什么来移除停用词?

huangapple go评论72阅读模式
英文:

In golang, what is the fastest way to remove stop words?

问题

我创建了一个用于去除停用词的Go包,并且我正在尝试对其进行优化。

根据我的研究,许多语言中的停用词列表平均包含大约300个单词。

在当前版本的包中,我使用一个简单的映射来存储停用词列表。然后,我通过将不在停用词映射中的单词添加到原始内容中,来分割单词并重新创建一个过滤后的内容。

我尝试使用了布隆过滤器,但它并没有提高性能。我认为这是由于两个因素造成的:

  • 布隆过滤器在搜索大型集合时速度很快,但构建成本很高(即使只构建一次)。因此,在m约为300时,整体收益很小。
  • 在当前版本中,我使用了映射,如果我记得没错,Go会为了更快地搜索键而构建一个哈希表。

有更快的方法吗?

英文:

I've created a go package to remove stopwords and I'm trying to optimize it.

Based on my research, the average list of stop words in many languages contains around 300 words.

In the current version of the package, I'm using a simple map in order to store the list of stop words. Then, I break words in the original content and recreate a filtered content by adding the words that are not in the map (of stopwords).

I've tried to use a bloom filter, but it doesn't improve the performance. I think it's due to two factors :

  • Bloom filters are fast when it comes to search into a large set, but they cost a lot to build (even if it is built once). So the overall gain is little where m is about 300.
  • In the current version, I used maps and, if I remeber well, go builds a hashmap for searching the key faster.

Is there a faster way?

答案1

得分: 4

尝试通过将所有候选词用|连接起来并提前编译成正则表达式。RE2正则表达式引擎将把这个大的替代列表转换为一个高效的trie数据结构进行匹配。你可以这样做:

reStr := ""

for i, word := range words {
    if i != 0 {
        reStr += `|`
    }
    reStr += `\Q` + word + `\E`
}
re := regexp.MustCompile(reStr)

\Q\E可以防止在列表中的任何单词包含正则表达式元字符的情况下出现问题,否则它们是无害的)。

英文:

Try building a regex by pasting together all of your candidate words with | and compiling it ahead of time. The RE2 regex engine will convert the big list of alternations into an efficient trie data structure for matching. You can do it like:

reStr := ""

for i, word := range words {
    if i != 0 {
        reStr += `|`
    }
    reStr += `\Q` + word + `\E`
}
re := regexp.MustCompile(reStr)

(the \Q and \E prevent any problems in the unlikely case that any of the words in the list contain regex metacharacters, and are harmless otherwise).

huangapple
  • 本文由 发表于 2015年10月19日 22:34:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/33217194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定