构建一个ngram频率表并处理多字节符文

huangapple go评论96阅读模式
英文:

Building an ngram frequency table and dealing with multibyte runes

问题

我目前正在学习Go,并取得了很大的进展。我做的一种方法是将以前的项目和原型从一种语言移植到另一种语言。

现在我正在忙于一个我以前用Python原型设计的“语言检测器”。在这个模块中,我生成一个ngram频率表,然后计算给定文本与已知语料库之间的差异。

这样可以通过返回给定ngram表的两个向量表示的余弦来有效地确定哪个语料库最匹配。太棒了,数学。

我用Go编写了一个原型,可以完美地处理纯ASCII字符,但我非常希望它能支持Unicode多字节。这就是我头疼的地方。

这里是我正在处理的一个快速示例:http://play.golang.org/p/2bnAjZX3r0

我只发布了生成表的逻辑,因为一切都已经正常工作。

通过运行这段代码,你可以看到第一个文本效果很好,构建了一个准确的表。第二个文本是德语,其中有一些双字节字符。由于我构建ngram序列的方式,以及这些特定符文由两个字节组成,会出现两个ngram,其中第一个字节被截断。

有人可以提供一个更高效的解决方案,或者至少指导我如何修复吗?我几乎可以肯定我过分分析了这个问题。

我计划开源这个包,并使用Martini将其实现为一个服务,从而为人们提供一个可以用于简单语言计算的简单API。

再次感谢!

英文:

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.

Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.

This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.

I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.

Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0

I've only posted the table generating logic since everything already works just fine.

As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.

Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.

I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.

As ever, thanks!

答案1

得分: 1

如果我理解正确,你希望在你的Parse函数中,chars变量保存字符串中的最后n个字符。由于你对Unicode字符感兴趣,而不是它们的UTF-8表示,你可能会发现将其作为[]rune切片更容易管理,并且只在准备好要添加到表中的ngram时才转换回字符串。这样,你就不需要在逻辑中特殊处理非ASCII字符。

这是对你示例程序的简单修改,实现了上述功能:http://play.golang.org/p/QMYoSlaGSv

英文:

If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.

Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv

答案2

得分: 1

通过保持一个循环缓冲区的符文,您可以最小化分配。还要注意,从映射中读取一个新的键会返回零值(对于int来说是0),这意味着您代码中的未知键检查是多余的。

func Parse(text string, n int) map[string]int {
    chars := make([]rune, 2*n)
    table := make(map[string]int)
    k := 0
    for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
        chars[n+k] = chars[k]
        k = (k + 1) % n
        table[string(chars[k:k+n])]++
    }
    return table
}
英文:

By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.

func Parse(text string, n int) map[string]int {
	chars := make([]rune, 2 * n)
	table := make(map[string]int)
	k := 0
	for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
		chars[n + k] = chars[k]
		k = (k + 1) % n
		table[string(chars[k:k+n])]++
	}
	return table
}

huangapple
  • 本文由 发表于 2013年12月26日 11:55:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/20778906.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定