2023年4月6日 23:36:18go评论59阅读模式

英文:

Elastic Search Reindexing failing silently

问题

我在尝试在Elasticsearch中进行语义搜索，遵循此教程。

当我按照此命令将索引文档复制到另一个索引[重新索引]时

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "collection"
  },
  "dest": {
    "index": "collection-with-embeddings",
    "pipeline": "text-embeddings"
  }
}

新索引中缺少一些文档。但我不知道原因。我正在尝试找出原因。

对于上下文，

PUT _ingest/pipeline/text-embeddings
{
  "description": "文本嵌入管道",
  "processors": [
    {
      "inference": {
        "model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
        "target_field": "text_embedding",
        "field_map": {
          "text": "text_field"
        }
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "将文档索引到 'failed-<index>'",
        "field": "_index",
        "value": "failed-{{{_index}}}"
      }
    },
    {
      "set": {
        "description": "设置错误消息",
        "field": "ingest.failure",
        "value": "{{_ingest.on_failure_message}}"
      }
    }
  ]
}

这是任务细节

{
    "completed": true,
    "task": {
        "node": "YgR8udaSSMqClwCGWOBGBw",
        "id": 5946104,
        "type": "transport",
        "action": "indices:data/write/reindex",
        "status": {
            "total": 2414,
            "updated": 1346,
            "created": 1068,
            "deleted": 0,
            "batches": 3,
            "version_conflicts": 0,
            "noops": 0,
            "retries": {
                "bulk": 0,
                "search": 0
            },
            "throttled_millis": 0,
            "requests_per_second": -1.0,
            "throttled_until_millis": 0
        },
        "description": "从[source_index]重新索引到[destination_index]",
        "start_time_in_millis": 1680795982705,
        "running_time_in_nanos": 22702121635,
        "cancellable": true,
        "cancelled": false,
        "headers": {}
    },
    "response": {
        "took": 22699,
        "timed_out": false,
        "total": 2414,
        "updated": 1346,
        "created": 1068,
        "deleted": 0,
        "batches": 3,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
            "bulk": 0,
            "search": 0
        },
        "throttled": "0s",
        "throttled_millis": 0,
        "requests_per_second": -1.0,
        "throttled_until": "0s",
        "throttled_until_millis": 0,
        "failures": []
    }
}

我的数据不同，但配置类似。大约有75%的数据未复制。

我正在使用来自Elasticsearch的 sentence-transformers__msmarco-minilm-l-12-v3。

有帮助吗？

英文:

I am trying semantic search in elastic search following this tutorial.

When I am copying an index documents to another index [reindexing] following this command

POST _reindex?wait_for_completion=false
{
  &quot;source&quot;: {
    &quot;index&quot;: &quot;collection&quot;
  },
  &quot;dest&quot;: {
    &quot;index&quot;: &quot;collection-with-embeddings&quot;,
    &quot;pipeline&quot;: &quot;text-embeddings&quot;
  }
}

Some of the documents are missing in the new index. But I do not know the reason. I am trying to find out the reason.

For context,

PUT _ingest/pipeline/text-embeddings
{
  &quot;description&quot;: &quot;Text embedding pipeline&quot;,
  &quot;processors&quot;: [
    {
      &quot;inference&quot;: {
        &quot;model_id&quot;: &quot;sentence-transformers__msmarco-minilm-l-12-v3&quot;,
        &quot;target_field&quot;: &quot;text_embedding&quot;,
        &quot;field_map&quot;: {
          &quot;text&quot;: &quot;text_field&quot;
        }
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Index document to &#39;failed-&lt;index&gt;&#39;&quot;,
        &quot;field&quot;: &quot;_index&quot;,
        &quot;value&quot;: &quot;failed-{{{_index}}}&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set error message&quot;,
        &quot;field&quot;: &quot;ingest.failure&quot;,
        &quot;value&quot;: &quot;{{_ingest.on_failure_message}}&quot;
      }
    }
  ]
}

This is tasks details

{
    &quot;completed&quot;: true,
    &quot;task&quot;: {
        &quot;node&quot;: &quot;YgR8udaSSMqClwCGWOBGBw&quot;,
        &quot;id&quot;: 5946104,
        &quot;type&quot;: &quot;transport&quot;,
        &quot;action&quot;: &quot;indices:data/write/reindex&quot;,
        &quot;status&quot;: {
            &quot;total&quot;: 2414,
            &quot;updated&quot;: 1346,
            &quot;created&quot;: 1068,
            &quot;deleted&quot;: 0,
            &quot;batches&quot;: 3,
            &quot;version_conflicts&quot;: 0,
            &quot;noops&quot;: 0,
            &quot;retries&quot;: {
                &quot;bulk&quot;: 0,
                &quot;search&quot;: 0
            },
            &quot;throttled_millis&quot;: 0,
            &quot;requests_per_second&quot;: -1.0,
            &quot;throttled_until_millis&quot;: 0
        },
        &quot;description&quot;: &quot;reindex from [source_index] to [destination_index]&quot;,
        &quot;start_time_in_millis&quot;: 1680795982705,
        &quot;running_time_in_nanos&quot;: 22702121635,
        &quot;cancellable&quot;: true,
        &quot;cancelled&quot;: false,
        &quot;headers&quot;: {}
    },
    &quot;response&quot;: {
        &quot;took&quot;: 22699,
        &quot;timed_out&quot;: false,
        &quot;total&quot;: 2414,
        &quot;updated&quot;: 1346,
        &quot;created&quot;: 1068,
        &quot;deleted&quot;: 0,
        &quot;batches&quot;: 3,
        &quot;version_conflicts&quot;: 0,
        &quot;noops&quot;: 0,
        &quot;retries&quot;: {
            &quot;bulk&quot;: 0,
            &quot;search&quot;: 0
        },
        &quot;throttled&quot;: &quot;0s&quot;,
        &quot;throttled_millis&quot;: 0,
        &quot;requests_per_second&quot;: -1.0,
        &quot;throttled_until&quot;: &quot;0s&quot;,
        &quot;throttled_until_millis&quot;: 0,
        &quot;failures&quot;: []
    }
}

My data is different, But the Configuration is similar. Around 75% data were not copied.

I am using sentence-transformers__msmarco-minilm-l-12-v3 from elastic search.

Any Help?

答案1

得分: 1

你可能没有足够的处理能力来执行推理处理器，因此一些文档会以ingest.failure字段中提到的原因，落在failed-collection-with-embeddings索引中。

你可以尝试使用更小的批处理（在源中指定较小的size），或者使用请求节流。

英文:

You probably don't have enough processing power for the inference processor, and as a result, some documents land in the failed-collection-with-embeddings index with the reason mentioned in the ingest.failure field.

What you can do is to use smaller batches (specifying a smaller size in source) or use request throttling.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

弹性搜索重新索引失败静默

问题

答案1

Elastic Search正则表达式未按预期工作。

Spring Data Elasticsearch多字段的主字段名称属性不起作用

在大型现有数据集上改进最终数值聚合性能

Does ElasticsearchIO for apache-beam java supports Templating and ValueProvider argument? Error While invoking templates

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论