2023年6月2日 11:58:08go评论85阅读模式

英文:

Tokenize each words from any start_offset

问题

以下是您要的翻译：

[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

希望这对您有所帮助！

英文:

I would like to tokenize the following text :

  &quot;text&quot;: &quot;king martin&quot;

into

[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti,  rtin, t, ti, tin, i, in, n]

But more especially into :

 [kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"

  &quot;ngram_tokenizer&quot;: {
        &quot;type&quot;: &quot;edge_ngram&quot;,
        &quot;min_gram&quot;: &quot;3&quot;,
        &quot;max_gram&quot;: &quot;15&quot;,
        &quot;token_chars&quot;: [
          &quot;letter&quot;,
          &quot;digit&quot;
        ]
      }

Thank you !

答案1

得分: 1

你可以使用ngram分词器而不是edge_gram。

PUT test_ngram_stack
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"index.max_ngram_diff": 10
}
}

POST test_ngram_stack/_analyze
{
"analyzer": "my_analyzer",
"text": "king martin"
}

英文:

You can use the ngram tokenizer rather than edge_gram.

PUT test_ngram_stack
{
  &quot;settings&quot;: {
    &quot;analysis&quot;: {
      &quot;analyzer&quot;: {
        &quot;my_analyzer&quot;: {
          &quot;tokenizer&quot;: &quot;my_tokenizer&quot;
        }
      },
      &quot;tokenizer&quot;: {
        &quot;my_tokenizer&quot;: {
          &quot;type&quot;: &quot;ngram&quot;,
          &quot;min_gram&quot;: 3,
          &quot;max_gram&quot;: 10,
          &quot;token_chars&quot;: [
            &quot;letter&quot;,
            &quot;digit&quot;
          ]
        }
      }
    },
    &quot;index.max_ngram_diff&quot;: 10
  }
}
POST test_ngram_stack/_analyze
{
  &quot;analyzer&quot;: &quot;my_analyzer&quot;,
  &quot;text&quot;: &quot;king martin&quot;
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从任何起始偏移量开始对每个单词进行标记化。

问题

答案1

如何解析由逗号分隔但未封闭在数组中的 JSON 对象字符串？

Golang Elastic APM – 保存定时任务的事务记录

将历史时间序列数据索引到Elasticsearch数据流 – ILM

我无法理解GNU C预处理器文档中的这个句子。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。