英文:
Tokenize each words from any start_offset
问题
以下是您要的翻译:
[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]
希望这对您有所帮助!
英文:
I would like to tokenize the following text :
"text": "king martin"
into
[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti, rtin, t, ti, tin, i, in, n]
But more especially into :
[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]
It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
Thank you !
答案1
得分: 1
你可以使用ngram分词器而不是edge_gram。
PUT test_ngram_stack
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"index.max_ngram_diff": 10
}
}
POST test_ngram_stack/_analyze
{
"analyzer": "my_analyzer",
"text": "king martin"
}
英文:
You can use the ngram tokenizer rather than edge_gram.
PUT test_ngram_stack
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"index.max_ngram_diff": 10
}
}
POST test_ngram_stack/_analyze
{
"analyzer": "my_analyzer",
"text": "king martin"
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论