英文:
Update by query is very slow in elastic search
问题
我正在尝试使用 Painless 脚本和 _update_by_query
将一个具有 object
类型的字段的数据复制到另一个具有 nested
类型的字段。我的 Elasticsearch 集群花费了约 48 小时来复制超过 840 万的数据,我想加速这个操作。
这是 URL:
localhost:9200/myindex/_update_by_query?conflicts=proceed&requests_per_second=50
这是请求体:
{
"query": {
"match_all": {}
},
"script": {
"inline": "ctx._source.myfield_copy = ctx._source.myfield;"
}
}
英文:
I am trying to copy the data of one field with object
type to another field with nested
type, using painless script and _update_by_query
. It took my Elasticsearch cluster ca. 48 hours to copy over 8.4M worth of data, and I want to speed up this operation.
This is the URL:
localhost:9200/myindex/_update_by_query?conflicts=proceed&requests_per_second=50
and here is the body:
{
"query": {
"match_all": {}
},
"script": {
"inline": "ctx._source.myfield_copy = ctx._source.myfield;"
}
}
Any way to speed up this operation?
答案1
得分: 3
通过指定 requests_per_second
,您正在限制速率 操作。
由于默认的批处理大小是1000,并且假设50条记录的写入时间为100毫秒(假设,实际情况可能有所不同),我们有
target_time = 1000 / 50 每秒 = 20 秒
wait_time = target_time - write_time = 20 秒 - 0.1 秒 = 19.9 秒
所以您基本上在每个批处理之间等待了19.9个虚拟秒,这可能是为什么花费这么长时间的原因。只需完全删除 requests_per_second
,应该已经会有很大帮助。
如果您只是将一个字段复制到另一个字段,另一个性能提升的方法是不使用脚本,而是使用摄取管道
PUT _ingest/pipeline/copy-field
{
"processors": [
{
"set": {
"field": "myfield_copy",
"copy_from": "myfield"
}
}
]
}
然后只需通过引用该管道运行更新查询:
POST myindex/_update_by_query?pipeline=copy-field&wait_for_completion=false
英文:
By specifying requests_per_second
you're throttling the operation.
Since the default bath size is 1000, and admitting the write time for 50 records is 100ms (hypothesis, your mileage may vary), we have
target_time = 1000 / 50 per second = 20 seconds
wait_time = target_time - write_time = 20 seconds - 0.1 seconds = 19.9 seconds
So you're basically waiting 19.9 artificial seconds between each batch, which is probably why it takes so long. Just remove requests_per_second
altogether and that should already help a lot.
If you're just copying a field into another, another performance improvement would be to not do it with a script but with an ingest pipeline
PUT _ingest/pipeline/copy-field
{
"processors": [
{
"set": {
"field": "myfield_copy",
"copy_from": "myfield"
}
}
]
}
And then simply run update by query by referencing that pipeline:
POST myindex/_update_by_query?pipeline=copy-field&wait_for_completion=false
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论