从任何起始偏移量开始对每个单词进行标记化。

huangapple go评论53阅读模式
英文:

Tokenize each words from any start_offset

问题

以下是您要的翻译:

[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

希望这对您有所帮助!

英文:

I would like to tokenize the following text :

  "text": "king martin"

into

[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti,  rtin, t, ti, tin, i, in, n]

But more especially into :

 [kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"

  "ngram_tokenizer": {
        "type": "edge_ngram",
        "min_gram": "3",
        "max_gram": "15",
        "token_chars": [
          "letter",
          "digit"
        ]
      }

Thank you !

答案1

得分: 1

你可以使用ngram分词器而不是edge_gram。

PUT test_ngram_stack
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"index.max_ngram_diff": 10
}
}

POST test_ngram_stack/_analyze
{
"analyzer": "my_analyzer",
"text": "king martin"
}

从任何起始偏移量开始对每个单词进行标记化。

英文:

You can use the ngram tokenizer rather than edge_gram.

PUT test_ngram_stack
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    },
    "index.max_ngram_diff": 10
  }
}

POST test_ngram_stack/_analyze
{
  "analyzer": "my_analyzer",
  "text": "king martin"
}

从任何起始偏移量开始对每个单词进行标记化。

huangapple
  • 本文由 发表于 2023年6月2日 11:58:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76387027.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定