英文:
GoLang PoS Tagger script taking longer than it should with no output in terminal
问题
这个脚本在play.golang.org上编译时没有错误:http://play.golang.org/p/Hlr-IAc_1f
但是当我在我的机器上运行时,发生了比我预期的要长得多的情况,终端上没有任何反应。
我试图构建的是一个词性标注器。
我认为最耗时的部分是将lexicon.txt加载到一个映射中,然后将每个单词与其中的每个单词进行比较,以查看它是否已经在词典中被标记。词典只包含动词。但是难道不是每个单词都需要检查是否是动词吗?
更大的问题是,我不知道如何通过简单的启发式方法来确定一个词是否是动词,比如副词、形容词等。
英文:
This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f
But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.
What I am trying to build is a PartOfSpeech Tagger.
I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.
The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
答案1
得分: 7
(引用):
> 我不知道如何通过简单的启发式方法(如副词、形容词等)确定一个词是否是动词。
我无法评论你的Go实现中的任何问题,但我将解决一般的词性标注问题。听起来你正在尝试构建一个基于规则的一元标注器。以下是对这些术语的一些详细说明:
- “一元”意味着你在句子中单独考虑每个单词。需要注意的是,一元标注器在本质上是有限的,因为它无法消除那些可以具有多个词性标记的单词的歧义。例如,你应该将“fish”标记为名词还是动词?“last”是动词还是副词?
- “基于规则”的意思就像它听起来的那样:一组规则来确定每个单词的标记。基于规则的标注在另一方面是有限的——它需要相当大的开发工作来组装一个规则集,以处理常见语言中的合理部分的歧义。如果你在一个我们没有好的训练资源的语言中工作,这种努力可能是合适的,但在大多数常见的语言中,我们现在有足够的标记文本来训练高准确性的标注模型。
词性标注的最新技术在规范的新闻文本上的准确率超过97%(在不太正式的文体上的准确率自然较低)。基于规则的标注器可能表现得更差(你需要确定满足你要求的准确性水平)。如果你想继续基于规则的路径,我建议阅读这个教程。代码是基于Haskell的,但它将帮助你学习基于规则的标注的概念和问题。
话虽如此,我强烈建议你看看其他的标注方法。我提到了一元标注的弱点。相关的方法包括“二元”,意思是在标记第n个词时考虑前一个词,“三元”(通常是前两个词,或者前一个词、当前词和下一个词);更一般地说,“n-gram”是指考虑n个词的序列(通常是围绕我们当前标记的词的一个滑动窗口)。这种上下文可以帮助我们消除“fish”、“last”、“flies”等的歧义。
例如,在句子中:
> We fish
我们可能希望将“fish”标记为动词,而在句子中:
> ate fish
它肯定是一个名词。
NLTK教程可能是一个很好的参考。一个良好的n-gram标注器应该能够达到90%以上的准确率;很可能超过95%(再次针对新闻文本)。
更复杂的方法(称为“结构化推理”)将整个标记序列作为一个整体来考虑。也就是说,它们不是尝试单独预测每个单词最可能的标记,而是尝试预测整个输入序列的最可能的标记序列。结构化推理当然更难实现和训练,但通常会提高准确性。如果你想在这个领域阅读一些资料,我建议阅读Sutton和McCallum的优秀介绍。
英文:
(Quoting):
> I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:
- "unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
- "rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.
State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.
That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.
E.g., in
> We fish
we probably want to tag fish as a verb, whereas in
> ate fish
it's certainly a noun.
The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).
More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.
答案2
得分: 0
你在这个函数中有一个大数组参数:
func stringInArray(a string, list [214]string) bool{
for _, b := range list{
if b == a{
return true;
}
}
return false
}
每次调用这个函数时,都会复制停用词数组。
在大多数情况下,Go 语言中应该使用切片而不是数组。将这个函数的定义改为 list []string
,并将 stopWords
定义为切片而不是数组:
stopWords := []string{
"and", "or", ...
}
可能更好的方法是构建一个停用词的映射:
isStopWord := map[string]bool{}
for _, sw := range stopWords {
isStopWord[sw] = true
}
然后你可以快速检查一个单词是否是停用词:
if isStopWord[word] { ... }
英文:
You've got a large array argument in this function:
func stringInArray(a string, list [214]string) bool{
for _, b := range list{
if b == a{
return true;
}
}
return false
}
The array of stopwords gets copied each time you call this function.
Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string
and define stopWords
as a slice rather than an array:
stopWords := []string{
"and", "or", ...
}
Probably an even better approach would be to build a map of the stopWords:
isStopWord := map[string]bool{}
for _, sw := range stopWords {
isStopWord[sw] = true
}
and then you can check if a word is a stopword quickly:
if isStopWord[word] { ... }
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论