英文:
Elastic Search Reindexing failing silently
问题
我在尝试在Elasticsearch中进行语义搜索,遵循此教程。
当我按照此命令将索引文档复制到另一个索引[重新索引]时
POST _reindex?wait_for_completion=false
{
"source": {
"index": "collection"
},
"dest": {
"index": "collection-with-embeddings",
"pipeline": "text-embeddings"
}
}
新索引中缺少一些文档。但我不知道原因。我正在尝试找出原因。
对于上下文,
PUT _ingest/pipeline/text-embeddings
{
"description": "文本嵌入管道",
"processors": [
{
"inference": {
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
"target_field": "text_embedding",
"field_map": {
"text": "text_field"
}
}
}
],
"on_failure": [
{
"set": {
"description": "将文档索引到 'failed-<index>'",
"field": "_index",
"value": "failed-{{{_index}}}"
}
},
{
"set": {
"description": "设置错误消息",
"field": "ingest.failure",
"value": "{{_ingest.on_failure_message}}"
}
}
]
}
这是任务细节
{
"completed": true,
"task": {
"node": "YgR8udaSSMqClwCGWOBGBw",
"id": 5946104,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 2414,
"updated": 1346,
"created": 1068,
"deleted": 0,
"batches": 3,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "从[source_index]重新索引到[destination_index]",
"start_time_in_millis": 1680795982705,
"running_time_in_nanos": 22702121635,
"cancellable": true,
"cancelled": false,
"headers": {}
},
"response": {
"took": 22699,
"timed_out": false,
"total": 2414,
"updated": 1346,
"created": 1068,
"deleted": 0,
"batches": 3,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled": "0s",
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until": "0s",
"throttled_until_millis": 0,
"failures": []
}
}
我的数据不同,但配置类似。大约有75%的数据未复制。
我正在使用来自Elasticsearch的 sentence-transformers__msmarco-minilm-l-12-v3
。
有帮助吗?
英文:
I am trying semantic search in elastic search following this tutorial.
When I am copying an index documents to another index [reindexing] following this command
POST _reindex?wait_for_completion=false
{
"source": {
"index": "collection"
},
"dest": {
"index": "collection-with-embeddings",
"pipeline": "text-embeddings"
}
}
Some of the documents are missing in the new index. But I do not know the reason. I am trying to find out the reason.
For context,
PUT _ingest/pipeline/text-embeddings
{
"description": "Text embedding pipeline",
"processors": [
{
"inference": {
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
"target_field": "text_embedding",
"field_map": {
"text": "text_field"
}
}
}
],
"on_failure": [
{
"set": {
"description": "Index document to 'failed-<index>'",
"field": "_index",
"value": "failed-{{{_index}}}"
}
},
{
"set": {
"description": "Set error message",
"field": "ingest.failure",
"value": "{{_ingest.on_failure_message}}"
}
}
]
}
This is tasks details
{
"completed": true,
"task": {
"node": "YgR8udaSSMqClwCGWOBGBw",
"id": 5946104,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 2414,
"updated": 1346,
"created": 1068,
"deleted": 0,
"batches": 3,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "reindex from [source_index] to [destination_index]",
"start_time_in_millis": 1680795982705,
"running_time_in_nanos": 22702121635,
"cancellable": true,
"cancelled": false,
"headers": {}
},
"response": {
"took": 22699,
"timed_out": false,
"total": 2414,
"updated": 1346,
"created": 1068,
"deleted": 0,
"batches": 3,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled": "0s",
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until": "0s",
"throttled_until_millis": 0,
"failures": []
}
}
My data is different, But the Configuration is similar. Around 75% data were not copied.
I am using sentence-transformers__msmarco-minilm-l-12-v3
from elastic search.
Any Help?
答案1
得分: 1
你可能没有足够的处理能力来执行推理处理器,因此一些文档会以ingest.failure
字段中提到的原因,落在failed-collection-with-embeddings
索引中。
你可以尝试使用更小的批处理(在源中指定较小的size
),或者使用请求节流。
英文:
You probably don't have enough processing power for the inference processor, and as a result, some documents land in the failed-collection-with-embeddings
index with the reason mentioned in the ingest.failure
field.
What you can do is to use smaller batches (specifying a smaller size
in source) or use request throttling.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论