奇怪的令牌化在Lucene 8的巴西葡萄牙分析器中

huangapple go评论73阅读模式
英文:

Strange tokenization in Lucene 8 Brazilian Portuguese analyzers

问题

我正在使用Lucene 8.6.2(目前可用的最新版本),与AdoptOpenJDK 11在Windows 10上,并且在葡萄牙语和巴西葡萄牙语分析器的标记化方面遇到了奇怪的问题。

让我们看一个简单的例子:来自Jorge Aragão著名桑巴歌曲《Já É》副歌的第一行,首先使用org.apache.lucene.analysis.standard.StandardAnalyzer作为参考。

String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      System.out.println("term: " + charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

这给出了以下术语(为了易读性,折叠成一行):

pra onde você for

好的,这几乎是我对任何分析器的预期结果。但是如果我改用org.apache.lucene.analysis.pt.PortugueseAnalyzer,使用无参数构造函数,我得到的是:

pra onde

诶?也许它认为"você"("你")和"for"("可能去")是停用词并将它们移除了。

但是现在让我们尝试使用org.apache.lucene.analysis.br.BrazilianAnalyzer,同样使用无参数构造函数:

pra ond voc for

现在这就是破损和混乱的。它将"onde"("在哪里")更改为"ond",据我所知,这甚至不是一个葡萄牙语单词。而对于"você",它只是去掉了"ê"。

其他行也同样糟糕甚至更糟:

  • 文本:"A saudade é dor, volta meu amor"
  • StandardAnalyzera saudade é dor volta meu amor
  • PortugueseAnalyzersaudad é dor volt amor
  • BrazilianAnalyzersaudad é dor volt amor

在这里,你可以看到葡萄牙语和巴西葡萄牙语分析器产生了相同的输出,但这是相同的错误输出,因为"volta"一定要保持为"volta"(而不是"volt"),如果我要让我的爱回到我这里。

我是否在使用Lucene核心库和语言分析器方面犯了一些严重错误?输出毫无意义,我对于针对如此常见的语言的分析器会破坏标记化结果感到惊讶。

英文:

I'm using Lucene 8.6.2 (currently the latest available) with AdoptOpenJDK 11 on Windows 10, and I'm having odd problems with the Portuguese and Brazilian Portuguese analyzers mangling the tokenization.

Let's take a simple example: the first line of the chorus from Jorge Aragão's famous samba song, "Já É", first using a org.apache.lucene.analysis.standard.StandardAnalyzer for reference.

> Pra onde você for

String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      System.out.println("term: charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

This gives me following terms (collapsed to one line for readability):

> pra onde você for

OK, that's pretty much what I would expect with any analyzer. But here is what I get if I use the org.apache.lucene.analysis.pt.PortugueseAnalyzer instead, using the no-args constructor:

> pra onde

Huh? Maybe it thinks that "você" ("you") and "for" ("may go") are stop words and removed them.

But now let's try the org.apache.lucene.analysis.br.BrazilianAnalyzer, again using the no-args constructor:

> pra ond voc for

Now that is just broken and mangled. It changed "onde" ("where") to "ond", which to my knowledge is not even a Portuguese word. And for "você" it just dropped the "ê".

Other lines are as bad or worse:

  • Text: "A saudade é dor, volta meu amor"
  • StandardAnalyzer: a saudade é dor volta meu amor
  • PortugueseAnalyzer: saudad é dor volt amor
  • BrazilianAnalyzer: saudad é dor volt amor

Here you can see that the Portuguese and Brazilian Portuguese analyzers produced the same output—but it is the same broken output, as "volta" sure needs to stay "volta" (and not "volt") if I'm very going to get my love to come back to me.

Am I making some serious mistake with the Lucene core libraries and language analyzers? The output makes no sense, and I'm surprised that analyzers for such a common language would mangle the tokens like that.

答案1

得分: 0

查看PortugueseAnalyzerBrazilianAnalyzer的代码,似乎这些分析器在执行词干提取。(我对Lucene编码有点陌生,所以这不是我预期的事情。)因此,对于索引,也许这是作者的意图。也许“você”是“vocês”和“vocês”的词干。我猜“volt”是动词(不定式形式)“voltar”的词干。(但是“saudad”不是我对“saudade”的预期词干,但同样,文本分析的这个方面对我来说有点新。)

对于我的特定用例,我只想对单词进行标记化并跳过停用词。我找不到在PortugueseAnalyzerBrazilianAnalyzer中关闭词干提取的方法,所以我想我将使用StandardAnalyzer,但使用语言特定分析器中的停用词,就像这样:

final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
  analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}

这有点绕,但至少这更符合我的预期:

  • 文本:“A saudade é dor,volta meu amor”
  • StandardAnalyzera saudade é dor volta meu amor
  • 带有PortugueseAnalyzer停用词的StandardAnalyzersaudade é dor volta amor
  • 带有BrazilianAnalyzer停用词的StandardAnalyzersaudade é dor volta meu amor

这样就好多了。但显然,葡萄牙分析器认为“meu”是停用词,尽管巴西分析器不是。我猜“my”的词在葡萄牙葡萄牙语和巴西葡萄牙语中基本上是相同的意思;两个分析器在默认情况下是否应将其作为停用词存在分歧似乎有点奇怪。

英文:

Looking at the code for the PortugueseAnalyzer and the BrazilianAnalyzer, it looks like these analyzers are performing stemming. (I'm a little new to coding Lucene, so it's not something I expected.) So for indexing, maybe this is what the authors intended. Perhaps "você" is a stem for "você" and "vocês". And I guess "volt" is the stem of the verb (infinitive form) "voltar". (But "saudad" is not what I would expect for the stem of "saudade", but again, this aspect of text analysis is a bit new to me.)

For my particular use case, I just want to tokenize the words and skip stop words. I can't find a way to turn off stemming for the PortugueseAnalyzer and the BrazilianAnalyzer, so I guess I'll just use a StandardAnalyzer but use the stop words from the language-specific analyzer, like this:

final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
  analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}

That's a little roundabout, but at least that gives me more what I was looking for:

  • Text: "A saudade é dor, volta meu amor"
  • StandardAnalyzer: a saudade é dor volta meu amor
  • StandardAnalyzer with PortugueseAnalyzer stop words: saudade é dor volta amor
  • StandardAnalyzer with BrazilianAnalyzer stop words: saudade é dor volta meu amor

That's better. But apparently the Portuguese analyzer thinks "meu" is a stop word, even though the Brazilian analyzer does not. I would guess that the word for "my" pretty much means the same in Portugal Portuguese and Brazilian Portuguese; it seems odd the two analyzers would disagree on whether it should be a stop word by default.

huangapple
  • 本文由 发表于 2020年9月21日 06:24:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/63984210.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定