2020年1月3日 19:45:07go评论138阅读模式

英文:

elasticsearch edge n-gram tokenizer: include symbols in the tokens

问题

我正在使用基于Edge NGram标记器的自定义标记器，并且希望能够搜索类似"sport+"的字符串，即我希望特殊符号，例如+号被视为标记的一部分。

例如，我们有以下字段的文档：

"typeName": "LC 500h Sport+ CVT"
或
"typeName": "LC 500h Sport CVT"。

执行具有以下子句的查询：

{
  &quot;match&quot;: {
    &quot;typeName&quot;: {
      &quot;query&quot;: &quot;sport+ cvt&quot;,
        &quot;operator&quot;: &quot;and&quot;
    }
  }
}

会检索到两个文档。然而，在这种情况下，我们只希望返回具有"typeName": "LC 500h Sport+ CVT"的文档。

我们在标记器设置中一直在使用以下token_chars类：digit，letter，punctuation。我认为将symbol作为一个token_chars类并重新创建索引会起作用，但事实并非如此。

编辑：
以下是Nest语法中的分析器定义：

Settings(s =&gt; s
	.Analysis(_ =&gt;
		_.Analyzers(a =&gt; a
				.Custom(
					&quot;vehicleanalyzer&quot;,
					descriptor =&gt; descriptor
						.Tokenizer(vehicleEdgeNgram)
						.Filters(&quot;lowercase&quot;))
				  .Standard(&quot;vehiclesearch&quot;,
				  descriptor =&gt; descriptor))
			.Tokenizers(descriptor =&gt; descriptor
				.EdgeNGram(
					vehicleEdgeNgram,
					tokenizerDescriptor =&gt;
						tokenizerDescriptor
							.MinGram(1)
							.MaxGram(10)
							.TokenChars(
								TokenChar.Digit,
								TokenChar.Letter,
								TokenChar.Punctuation,
								TokenChar.Symbol)))))

英文:

I am using a custom tokenizer based on the Edge NGram tokenizer, and I would like to be able to search for strings like "sport+", i.e., I would like the special symbols, e.g., the + sign to be considered part of the token.

For example, we have documents with the following fields:

"typeName": "LC 500h Sport+ CVT"
or
"typeName": "LC 500h Sport CVT".

Executing a query with the following clause:

{
  &quot;match&quot;: {
    &quot;typeName&quot;: {
      &quot;query&quot;: &quot;sport+ cvt&quot;,
        &quot;operator&quot;: &quot;and&quot;
    }
  }
}

fetches both documents. However, we would only like the document with "typeName": "LC 500h Sport+ CVT" to be returned in this case.

We have been using the following token_chars classes in the tokenizer settings: digit, letter, punctuation. I thought that adding symbol as a token_chars class and recreating the index would do the trick, but it has not helped.

EDIT:
Here is the analyzer definition in Nest syntax:

Settings(s =&gt; s
	.Analysis(_ =&gt;
		_.Analyzers(a =&gt; a
				.Custom(
					&quot;vehicleanalyzer&quot;,
					descriptor =&gt; descriptor
						.Tokenizer(vehicleEdgeNgram)
						.Filters(&quot;lowercase&quot;))
				  .Standard(&quot;vehiclesearch&quot;,
				  descriptor =&gt; descriptor))
			.Tokenizers(descriptor =&gt; descriptor
				.EdgeNGram(
					vehicleEdgeNgram,
					tokenizerDescriptor =&gt;
						tokenizerDescriptor
							.MinGram(1)
							.MaxGram(10)
							.TokenChars(
								TokenChar.Digit,
								TokenChar.Letter,
								TokenChar.Punctuation,
								TokenChar.Symbol)))))

答案1

得分: 1

如在文档中所述，token_chars 是：

> 应包括在令牌中的字符类别。Elasticsearch 将根据指定的类别进行分割。默认为 []（保留所有字符）。

默认情况下，Elasticsearch 保留所有字符。只有在您希望在反向索引中保留较少的字符类别时，才应使用此选项。因此，要解决您的问题，您只需删除 token_chars 的定义：您的分词器将保留所有字符。

英文:

As written in the documentation token_chars are:

> Character classes that should be included in a token. Elasticsearch
> will split on characters that don’t belong to the classes specified.
> Defaults to [] (keep all characters).

By default elasticsearch keep all the chars. You should use this option only if you want less classes of chars in your inverted index. So to resolve your problem you should simply remove the definition of token_chars: your tokenizer will keep all chars

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

elasticsearch 边缘 n-gram 分词器：在标记中包含符号

问题

答案1

@Field(type = FieldType.Object, enabled = false)

Solr的synonym.txt添加到Elasticsearch。

尝试将数据POST到ElasticSearch服务器8.6，但出现错误”no handler found for uri”。

安装Elasticsearch的knn插件时出现错误。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论