elasticsearch 边缘 n-gram 分词器:在标记中包含符号

huangapple go评论57阅读模式
英文:

elasticsearch edge n-gram tokenizer: include symbols in the tokens

问题

我正在使用基于Edge NGram标记器的自定义标记器,并且希望能够搜索类似"sport+"的字符串,即我希望特殊符号,例如+号被视为标记的一部分。

例如,我们有以下字段的文档:

"typeName": "LC 500h Sport+ CVT"

"typeName": "LC 500h Sport CVT"

执行具有以下子句的查询:

{
  "match": {
    "typeName": {
      "query": "sport+ cvt",
        "operator": "and"
    }
  }
}

会检索到两个文档。然而,在这种情况下,我们只希望返回具有"typeName": "LC 500h Sport+ CVT"的文档。

我们在标记器设置中一直在使用以下token_chars类:digitletterpunctuation。我认为将symbol作为一个token_chars类并重新创建索引会起作用,但事实并非如此。

编辑
以下是Nest语法中的分析器定义:

Settings(s => s
	.Analysis(_ =>
		_.Analyzers(a => a
				.Custom(
					"vehicleanalyzer",
					descriptor => descriptor
						.Tokenizer(vehicleEdgeNgram)
						.Filters("lowercase"))
				  .Standard("vehiclesearch",
				  descriptor => descriptor))
			.Tokenizers(descriptor => descriptor
				.EdgeNGram(
					vehicleEdgeNgram,
					tokenizerDescriptor =>
						tokenizerDescriptor
							.MinGram(1)
							.MaxGram(10)
							.TokenChars(
								TokenChar.Digit,
								TokenChar.Letter,
								TokenChar.Punctuation,
								TokenChar.Symbol)))))
英文:

I am using a custom tokenizer based on the Edge NGram tokenizer, and I would like to be able to search for strings like "sport+", i.e., I would like the special symbols, e.g., the + sign to be considered part of the token.

For example, we have documents with the following fields:

"typeName": "LC 500h Sport+ CVT"
or
"typeName": "LC 500h Sport CVT".

Executing a query with the following clause:

{
  "match": {
    "typeName": {
      "query": "sport+ cvt",
        "operator": "and"
    }
  }
}

fetches both documents. However, we would only like the document with "typeName": "LC 500h Sport+ CVT" to be returned in this case.

We have been using the following token_chars classes in the tokenizer settings: digit, letter, punctuation. I thought that adding symbol as a token_chars class and recreating the index would do the trick, but it has not helped.

EDIT:
Here is the analyzer definition in Nest syntax:

Settings(s => s
	.Analysis(_ =>
		_.Analyzers(a => a
				.Custom(
					"vehicleanalyzer",
					descriptor => descriptor
						.Tokenizer(vehicleEdgeNgram)
						.Filters("lowercase"))
				  .Standard("vehiclesearch",
				  descriptor => descriptor))
			.Tokenizers(descriptor => descriptor
				.EdgeNGram(
					vehicleEdgeNgram,
					tokenizerDescriptor =>
						tokenizerDescriptor
							.MinGram(1)
							.MaxGram(10)
							.TokenChars(
								TokenChar.Digit,
								TokenChar.Letter,
								TokenChar.Punctuation,
								TokenChar.Symbol)))))

答案1

得分: 1

如在文档中所述,token_chars 是:

> 应包括在令牌中的字符类别。Elasticsearch 将根据指定的类别进行分割。默认为 [](保留所有字符)。

默认情况下,Elasticsearch 保留所有字符。只有在您希望在反向索引中保留较少的字符类别时,才应使用此选项。因此,要解决您的问题,您只需删除 token_chars 的定义:您的分词器将保留所有字符。

英文:

As written in the documentation token_chars are:

> Character classes that should be included in a token. Elasticsearch
> will split on characters that don’t belong to the classes specified.
> Defaults to [] (keep all characters).

By default elasticsearch keep all the chars. You should use this option only if you want less classes of chars in your inverted index. So to resolve your problem you should simply remove the definition of token_chars: your tokenizer will keep all chars

huangapple
  • 本文由 发表于 2020年1月3日 19:45:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/59578056.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定