从任何起始偏移量开始对每个单词进行标记化。

huangapple go评论85阅读模式
英文:

Tokenize each words from any start_offset

问题

以下是您要的翻译:

  1. [kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

希望这对您有所帮助!

英文:

I would like to tokenize the following text :

  1. "text": "king martin"

into

  1. [k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti, rtin, t, ti, tin, i, in, n]

But more especially into :

  1. [kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"

  1. "ngram_tokenizer": {
  2. "type": "edge_ngram",
  3. "min_gram": "3",
  4. "max_gram": "15",
  5. "token_chars": [
  6. "letter",
  7. "digit"
  8. ]
  9. }

Thank you !

答案1

得分: 1

你可以使用ngram分词器而不是edge_gram。

PUT test_ngram_stack
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"index.max_ngram_diff": 10
}
}

POST test_ngram_stack/_analyze
{
"analyzer": "my_analyzer",
"text": "king martin"
}

从任何起始偏移量开始对每个单词进行标记化。

英文:

You can use the ngram tokenizer rather than edge_gram.

  1. PUT test_ngram_stack
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_analyzer": {
  7. "tokenizer": "my_tokenizer"
  8. }
  9. },
  10. "tokenizer": {
  11. "my_tokenizer": {
  12. "type": "ngram",
  13. "min_gram": 3,
  14. "max_gram": 10,
  15. "token_chars": [
  16. "letter",
  17. "digit"
  18. ]
  19. }
  20. }
  21. },
  22. "index.max_ngram_diff": 10
  23. }
  24. }
  25. POST test_ngram_stack/_analyze
  26. {
  27. "analyzer": "my_analyzer",
  28. "text": "king martin"
  29. }

从任何起始偏移量开始对每个单词进行标记化。

huangapple
  • 本文由 发表于 2023年6月2日 11:58:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76387027.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定