2020年9月21日 06:24:05go评论73阅读模式

英文:

Strange tokenization in Lucene 8 Brazilian Portuguese analyzers

问题

我正在使用Lucene 8.6.2（目前可用的最新版本），与AdoptOpenJDK 11在Windows 10上，并且在葡萄牙语和巴西葡萄牙语分析器的标记化方面遇到了奇怪的问题。

让我们看一个简单的例子：来自Jorge Aragão著名桑巴歌曲《Já É》副歌的第一行，首先使用org.apache.lucene.analysis.standard.StandardAnalyzer作为参考。

String text = "Pra onde você for";
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream("text", text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      System.out.println("term: " + charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

这给出了以下术语（为了易读性，折叠成一行）：

pra onde você for

好的，这几乎是我对任何分析器的预期结果。但是如果我改用org.apache.lucene.analysis.pt.PortugueseAnalyzer，使用无参数构造函数，我得到的是：

pra onde

诶？也许它认为"você"（"你"）和"for"（"可能去"）是停用词并将它们移除了。

但是现在让我们尝试使用org.apache.lucene.analysis.br.BrazilianAnalyzer，同样使用无参数构造函数：

pra ond voc for

现在这就是破损和混乱的。它将"onde"（"在哪里"）更改为"ond"，据我所知，这甚至不是一个葡萄牙语单词。而对于"você"，它只是去掉了"ê"。

其他行也同样糟糕甚至更糟：

文本："A saudade é dor, volta meu amor"
StandardAnalyzer：a saudade é dor volta meu amor
PortugueseAnalyzer：saudad é dor volt amor
BrazilianAnalyzer：saudad é dor volt amor

在这里，你可以看到葡萄牙语和巴西葡萄牙语分析器产生了相同的输出，但这是相同的错误输出，因为"volta"一定要保持为"volta"（而不是"volt"），如果我要让我的爱回到我这里。

我是否在使用Lucene核心库和语言分析器方面犯了一些严重错误？输出毫无意义，我对于针对如此常见的语言的分析器会破坏标记化结果感到惊讶。

英文:

I'm using Lucene 8.6.2 (currently the latest available) with AdoptOpenJDK 11 on Windows 10, and I'm having odd problems with the Portuguese and Brazilian Portuguese analyzers mangling the tokenization.

Let's take a simple example: the first line of the chorus from Jorge Aragão's famous samba song, "Já É", first using a org.apache.lucene.analysis.standard.StandardAnalyzer for reference.

> Pra onde você for

String text = &quot;Pra onde voc&#234; for&quot;;
try (Analyzer analyzer = new StandardAnalyzer()) {
  try (final TokenStream tokenStream = analyzer.tokenStream(&quot;text&quot;, text)) {
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      System.out.println(&quot;term: charTermAttribute.toString());
    }
    tokenStream.end();
  }
}

This gives me following terms (collapsed to one line for readability):

> pra onde você for

OK, that's pretty much what I would expect with any analyzer. But here is what I get if I use the org.apache.lucene.analysis.pt.PortugueseAnalyzer instead, using the no-args constructor:

> pra onde

Huh? Maybe it thinks that "você" ("you") and "for" ("may go") are stop words and removed them.

But now let's try the org.apache.lucene.analysis.br.BrazilianAnalyzer, again using the no-args constructor:

> pra ond voc for

Now that is just broken and mangled. It changed "onde" ("where") to "ond", which to my knowledge is not even a Portuguese word. And for "você" it just dropped the "ê".

Other lines are as bad or worse:

Text: "A saudade é dor, volta meu amor"
StandardAnalyzer: a saudade é dor volta meu amor
PortugueseAnalyzer: saudad é dor volt amor
BrazilianAnalyzer: saudad é dor volt amor

Here you can see that the Portuguese and Brazilian Portuguese analyzers produced the same output—but it is the same broken output, as "volta" sure needs to stay "volta" (and not "volt") if I'm very going to get my love to come back to me.

Am I making some serious mistake with the Lucene core libraries and language analyzers? The output makes no sense, and I'm surprised that analyzers for such a common language would mangle the tokens like that.

答案1

得分: 0

查看PortugueseAnalyzer和BrazilianAnalyzer的代码，似乎这些分析器在执行词干提取。（我对Lucene编码有点陌生，所以这不是我预期的事情。）因此，对于索引，也许这是作者的意图。也许“você”是“vocês”和“vocês”的词干。我猜“volt”是动词（不定式形式）“voltar”的词干。（但是“saudad”不是我对“saudade”的预期词干，但同样，文本分析的这个方面对我来说有点新。）

对于我的特定用例，我只想对单词进行标记化并跳过停用词。我找不到在PortugueseAnalyzer和BrazilianAnalyzer中关闭词干提取的方法，所以我想我将使用StandardAnalyzer，但使用语言特定分析器中的停用词，就像这样：

final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
  analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}

这有点绕，但至少这更符合我的预期：

文本：“A saudade é dor，volta meu amor”
StandardAnalyzer：a saudade é dor volta meu amor
带有PortugueseAnalyzer停用词的StandardAnalyzer：saudade é dor volta amor
带有BrazilianAnalyzer停用词的StandardAnalyzer：saudade é dor volta meu amor

这样就好多了。但显然，葡萄牙分析器认为“meu”是停用词，尽管巴西分析器不是。我猜“my”的词在葡萄牙葡萄牙语和巴西葡萄牙语中基本上是相同的意思；两个分析器在默认情况下是否应将其作为停用词存在分歧似乎有点奇怪。

英文:

Looking at the code for the PortugueseAnalyzer and the BrazilianAnalyzer, it looks like these analyzers are performing stemming. (I'm a little new to coding Lucene, so it's not something I expected.) So for indexing, maybe this is what the authors intended. Perhaps "você" is a stem for "você" and "vocês". And I guess "volt" is the stem of the verb (infinitive form) "voltar". (But "saudad" is not what I would expect for the stem of "saudade", but again, this aspect of text analysis is a bit new to me.)

For my particular use case, I just want to tokenize the words and skip stop words. I can't find a way to turn off stemming for the PortugueseAnalyzer and the BrazilianAnalyzer, so I guess I'll just use a StandardAnalyzer but use the stop words from the language-specific analyzer, like this:

final Analyzer analyzer;
try (BrazilianAnalyzer ptBRAnalyzer = new BrazilianAnalyzer()) {
  analyzer = new StandardAnalyzer(ptBRAnalyzer.getStopwordSet());
}

That's a little roundabout, but at least that gives me more what I was looking for:

Text: "A saudade é dor, volta meu amor"
StandardAnalyzer: a saudade é dor volta meu amor
StandardAnalyzer with PortugueseAnalyzer stop words: saudade é dor volta amor
StandardAnalyzer with BrazilianAnalyzer stop words: saudade é dor volta meu amor

That's better. But apparently the Portuguese analyzer thinks "meu" is a stop word, even though the Brazilian analyzer does not. I would guess that the word for "my" pretty much means the same in Portugal Portuguese and Brazilian Portuguese; it seems odd the two analyzers would disagree on whether it should be a stop word by default.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

奇怪的令牌化在Lucene 8的巴西葡萄牙分析器中

问题

答案1

无法使用 JSONPath 解析和返回布尔值

Java并发 – 中断策略

在Spring Boot中，如何使用扩展的setter和getter方法来设置（扩展的）属性？

2 Problems. Unable to retrieve correct output for area of Triangle, and unsure of what else-if state to use so my input can recognize a square

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论