如何在Go语言中计算日语单词数量

huangapple go评论97阅读模式
英文:

How do I count Japanese words in Go-lang

问题

通过Go-Tour的学习,可以得出一个很好的印象,即Unicode在Go语言中是开箱即用的。

在其他编程语言(如php)中,计算不使用标准分隔符(如空格)的单词,特别是在日语和中文中,一直是一个痛点。因此,我很好奇是否可以使用Go编程语言来计算用日语(例如片假名)编写的单词。

如果可以,应该如何实现?

英文:

Walking through the Go-Tour it gives nice impression that Unicode is supported out-of-the-box.

Counting words that don't use standard separators like spaces specially in Japanese and Chinese have been painful in other programming languages (php), so, curious to know if it is possible to count the words written in Japanese language (eg: katakana) using Go-programming language.

If yes, how ?

答案1

得分: 1

答案是是的。可以使用Go编程语言来计算使用日语(例如:片假名)编写的单词。但首先,您需要改进您的问题。

阅读您的短语的人可能会认为,单词计数是一个明确定义的操作。即使对于英语这样的语言也不是如此。在短语“testing 1 2 3 testing”中,字符串“1 2 3”代表一个单词、三个单词还是零个单词?对于“testing 123 testing”这个问题的答案是否不同?短语“testing <mytag class="numbers">1 2 3</mytag> testing”中有多少个单词?

有人可能还会认为日语这样的语言有一个类似于英语但具有不同语法约定的“单词”概念。这是不正确的,对于许多语言,如日语、书面汉语和泰语。

因此,您必须首先通过定义拉丁字母文本中的“单词”来改进您的问题,例如英语。

您是否希望使用基于间隔字符的简单词汇定义?那么可以考虑使用Unicode TR 29 Version 4.1.0 - Text Boundaries,第4节Word Boundaries。它使用正则表达式和Unicode字符属性来定义“单词边界”。本地化行业标准GMX-VWord Boundaries部分使用了TR 29。

一旦您有了定义,我相信您可以使用Go的包,如unicodetext/scanner来实现它。我自己没有做过这个。从快速查看官方包列表,现有的包似乎没有TR 29的实现。但是,您的问题是是否“可能”,而不是“已经由官方包实现”。

接下来,对于日语:您是否希望对“单词”进行简单的词汇定义?如果是这样,Unicode TR 29提供了定义。他们说,

> 对于泰语、老挝语、高棉语、缅甸语和其他通常不使用单词间空格的脚本,一个好的实现不应依赖于默认的单词边界规范。它应该使用更复杂的机制,就像断行一样。象日语和汉语这样的表意文字更加复杂。在没有更复杂的机制的情况下,本附录中指定的规则提供了一个明确定义的默认值。

如果您希望在日语环境中对“单词”进行语言学上复杂的定义,那么您需要开始考虑@Jhilke Dai、Sergio Tulentsev和其他贡献者提出的问题。您需要设计“单词”的规范。然后您需要实现它。我相信截至2014年7月,您在官方的Go包中找不到这样的实现。但是,我也相信,如果您能设计出一个清晰的规范,那么在Go中实现它是“可能”的。

现在:这个回复中有多少个单词?您是如何计数的?

英文:

The answer is Yes. It is "possible to count the words written in Japanese language (eg: katakana) using Go-programming language." But first you need to improve your question.

Someone reading your phrase, "standard separators like spaces", might believe that word counting is a well-defined operation. It is not, even for languages like English. In the phrase, "testing 1 2 3 testing", does the string "1 2 3" represent one word, or three, or zero? Is the answer different for "testing 123 testing"? How many words are in the phrase, "testing &lt;mytag class=&quot;numbers&quot;&gt;1 2 3&lt;/mytag&gt; testing"?

Someone might also believe the Japanese language has a concept of "words", analogous to English, but with a different syntactical convention. That is not correct -- for many languages, like Japanese, written Chinese, and Thai.

So, you must first improve your question by defining what "words" are, in Latin-script text, for languages like English.

Do you want a simple lexical definition, based on presence of spacing characters? Then consider using Unicode TR 29 Version 4.1.0 - Text Boundaries, Section 4 Word Boundaries. This defines "word boundaries" in terms of regular expressions and Unicode character properties. The localisation industry standard GMX-V, Word Boundaries section, uses TR 29.

Once you have your definition, I'm confident you'd be able to implement it using Go packages like unicode and text/scanner. I haven't done this myself. From a quick look at the official packages list, it looks like the existing packages don't have a TR 29 implementation. But your question asks if it is "possible", not "already implemented by an official package".

Next, for Japanese: do you want a simple lexical definition of "word"? If so, Unicode TR 29 supplies it. They say,

> For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

If you want a linguistically sophisticated definition of "word" in the Japanese context, then you need to start considering the issues raised by @Jhilke Dai, Sergio Tulentsev, and the other contributors. You will need to design your specification of "word". Then you will need to implement it. I'm confident you will not find such an implementation in an official Go package as of July 2014. However, I'm also confident that if you can design a clear specification, it is "possible" to implement it in Go.

Now: how many words are there in this reply? How did you count them?

huangapple
  • 本文由 发表于 2014年7月4日 22:39:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/24576659.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定