英文:
Strange tokenization in Lucene 8 Brazilian Portuguese analyzers
问题
我正在使用Lucene 8.6.2(目前可用的最新版本),与AdoptOpenJDK 11在Windows 10上,并且在葡萄牙语和巴西葡萄牙语分析器的标记化方面遇到了奇怪的问题。
让我们看一个简单的例子:来自Jorge Aragão著名桑巴歌曲《Já É》副歌的第一行,首先使用org.apache.lucene.analysis.standard.StandardAnalyzer
作为参考。
String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.println("term: " + charTermAttribute.toString());
}
tokenStream.end();
}
}
这给出了以下术语(为了易读性,折叠成一行):
pra onde você for
好的,这几乎是我对任何分析器的预期结果。但是如果我改用org.apache.lucene.analysis.pt.PortugueseAnalyzer
,使用无参数构造函数,我得到的是:
pra onde
诶?也许它认为"você"("你")和"for"("可能去")是停用词并将它们移除了。
但是现在让我们尝试使用org.apache.lucene.analysis.br.BrazilianAnalyzer
,同样使用无参数构造函数:
pra ond voc for
现在这就是破损和混乱的。它将"onde"("在哪里")更改为"ond",据我所知,这甚至不是一个葡萄牙语单词。而对于"você",它只是去掉了"ê"。
其他行也同样糟糕甚至更糟:
- 文本:"A saudade é dor, volta meu amor"
StandardAnalyzer
:a saudade é dor volta meu amor
PortugueseAnalyzer
:saudad é dor volt amor
BrazilianAnalyzer
:saudad é dor volt amor
在这里,你可以看到葡萄牙语和巴西葡萄牙语分析器产生了相同的输出,但这是相同的错误输出,因为"volta"一定要保持为"volta"(而不是"volt"),如果我要让我的爱回到我这里。
我是否在使用Lucene核心库和语言分析器方面犯了一些严重错误?输出毫无意义,我对于针对如此常见的语言的分析器会破坏标记化结果感到惊讶。
英文:
I'm using Lucene 8.6.2 (currently the latest available) with AdoptOpenJDK 11 on Windows 10, and I'm having odd problems with the Portuguese and Brazilian Portuguese analyzers mangling the tokenization.
Let's take a simple example: the first line of the chorus from Jorge Aragão's famous samba song, "Já É", first using a org.apache.lucene.analysis.standard.StandardAnalyzer
for reference.
> Pra onde você for
String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.println("term: charTermAttribute.toString());
}
tokenStream.end();
}
}
This gives me following terms (collapsed to one line for readability):
> pra onde você for
OK, that's pretty much what I would expect with any analyzer. But here is what I get if I use the org.apache.lucene.analysis.pt.PortugueseAnalyzer
instead, using the no-args constructor:
> pra onde
Huh? Maybe it thinks that "você" ("you") and "for" ("may go") are stop words and removed them.
But now let's try the org.apache.lucene.analysis.br.BrazilianAnalyzer
, again using the no-args constructor:
> pra ond voc for
Now that is just broken and mangled. It changed "onde" ("where") to "ond", which to my knowledge is not even a Portuguese word. And for "você" it just dropped the "ê".
Other lines are as bad or worse:
- Text: "A saudade é dor, volta meu amor"
StandardAnalyzer
:a saudade é dor volta meu amor
PortugueseAnalyzer
:saudad é dor volt amor
BrazilianAnalyzer
:saudad é dor volt amor
Here you can see that the Portuguese and Brazilian Portuguese analyzers produced the same output—but it is the same broken output, as "volta" sure needs to stay "volta" (and not "volt") if I'm very going to get my love to come back to me.
Am I making some serious mistake with the Lucene core libraries and language analyzers? The output makes no sense, and I'm surprised that analyzers for such a common language would mangle the tokens like that.
答案1
得分: 0
查看PortugueseAnalyzer
和BrazilianAnalyzer
的代码,似乎这些分析器在执行词干提取。(我对Lucene编码有点陌生,所以这不是我预期的事情。)因此,对于索引,也许这是作者的意图。也许“você”是“vocês”和“vocês”的词干。我猜“volt”是动词(不定式形式)“voltar”的词干。(但是“saudad”不是我对“saudade”的预期词干,但同样,文本分析的这个方面对我来说有点新。)
对于我的特定用例,我只想对单词进行标记化并跳过停用词。我找不到在PortugueseAnalyzer
和BrazilianAnalyzer
中关闭词干提取的方法,所以我想我将使用StandardAnalyzer
,但使用语言特定分析器中的停用词,就像这样:
final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}
这有点绕,但至少这更符合我的预期:
- 文本:“A saudade é dor,volta meu amor”
StandardAnalyzer
:a saudade é dor volta meu amor
- 带有
PortugueseAnalyzer
停用词的StandardAnalyzer
:saudade é dor volta amor
- 带有
BrazilianAnalyzer
停用词的StandardAnalyzer
:saudade é dor volta meu amor
这样就好多了。但显然,葡萄牙分析器认为“meu”是停用词,尽管巴西分析器不是。我猜“my”的词在葡萄牙葡萄牙语和巴西葡萄牙语中基本上是相同的意思;两个分析器在默认情况下是否应将其作为停用词存在分歧似乎有点奇怪。
英文:
Looking at the code for the PortugueseAnalyzer
and the BrazilianAnalyzer
, it looks like these analyzers are performing stemming. (I'm a little new to coding Lucene, so it's not something I expected.) So for indexing, maybe this is what the authors intended. Perhaps "você" is a stem for "você" and "vocês". And I guess "volt" is the stem of the verb (infinitive form) "voltar". (But "saudad" is not what I would expect for the stem of "saudade", but again, this aspect of text analysis is a bit new to me.)
For my particular use case, I just want to tokenize the words and skip stop words. I can't find a way to turn off stemming for the PortugueseAnalyzer
and the BrazilianAnalyzer
, so I guess I'll just use a StandardAnalyzer
but use the stop words from the language-specific analyzer, like this:
final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}
That's a little roundabout, but at least that gives me more what I was looking for:
- Text: "A saudade é dor, volta meu amor"
StandardAnalyzer
:a saudade é dor volta meu amor
StandardAnalyzer
withPortugueseAnalyzer
stop words:saudade é dor volta amor
StandardAnalyzer
withBrazilianAnalyzer
stop words:saudade é dor volta meu amor
That's better. But apparently the Portuguese analyzer thinks "meu" is a stop word, even though the Brazilian analyzer does not. I would guess that the word for "my" pretty much means the same in Portugal Portuguese and Brazilian Portuguese; it seems odd the two analyzers would disagree on whether it should be a stop word by default.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论