在Elasticsearch中,通过查询进行更新速度非常慢。

huangapple go评论60阅读模式
英文:

Update by query is very slow in elastic search

问题

我正在尝试使用 Painless 脚本和 _update_by_query 将一个具有 object 类型的字段的数据复制到另一个具有 nested 类型的字段。我的 Elasticsearch 集群花费了约 48 小时来复制超过 840 万的数据,我想加速这个操作。

这是 URL:

localhost:9200/myindex/_update_by_query?conflicts=proceed&requests_per_second=50

这是请求体:

{
    "query": {
        "match_all": {}
    },
    "script": {
        "inline": "ctx._source.myfield_copy = ctx._source.myfield;"
    }
}
英文:

I am trying to copy the data of one field with object type to another field with nested type, using painless script and _update_by_query. It took my Elasticsearch cluster ca. 48 hours to copy over 8.4M worth of data, and I want to speed up this operation.

This is the URL:

localhost:9200/myindex/_update_by_query?conflicts=proceed&requests_per_second=50

and here is the body:

{
    "query": {
        "match_all": {}
    },
    "script": {
        "inline": "ctx._source.myfield_copy = ctx._source.myfield;"
    }
}

Any way to speed up this operation?

答案1

得分: 3

通过指定 requests_per_second,您正在限制速率 操作。

由于默认的批处理大小是1000,并且假设50条记录的写入时间为100毫秒(假设,实际情况可能有所不同),我们有

target_time = 1000 / 50 每秒 = 20 秒
wait_time = target_time - write_time = 20 秒 - 0.1 秒 = 19.9 秒

所以您基本上在每个批处理之间等待了19.9个虚拟秒,这可能是为什么花费这么长时间的原因。只需完全删除 requests_per_second,应该已经会有很大帮助。

如果您只是将一个字段复制到另一个字段,另一个性能提升的方法是不使用脚本,而是使用摄取管道

PUT _ingest/pipeline/copy-field
{
  "processors": [
    {
      "set": {
        "field": "myfield_copy",
        "copy_from": "myfield"
      }
    }
  ]
}

然后只需通过引用该管道运行更新查询:

POST myindex/_update_by_query?pipeline=copy-field&wait_for_completion=false
英文:

By specifying requests_per_second you're throttling the operation.

Since the default bath size is 1000, and admitting the write time for 50 records is 100ms (hypothesis, your mileage may vary), we have

target_time = 1000 / 50 per second = 20 seconds
wait_time = target_time - write_time = 20 seconds - 0.1 seconds = 19.9 seconds

So you're basically waiting 19.9 artificial seconds between each batch, which is probably why it takes so long. Just remove requests_per_second altogether and that should already help a lot.

If you're just copying a field into another, another performance improvement would be to not do it with a script but with an ingest pipeline

PUT _ingest/pipeline/copy-field
{
  "processors": [
    {
      "set": {
        "field": "myfield_copy",
        "copy_from": "myfield"
      }
    }
  ]
}

And then simply run update by query by referencing that pipeline:

POST myindex/_update_by_query?pipeline=copy-field&wait_for_completion=false

huangapple
  • 本文由 发表于 2023年3月9日 22:11:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75685751.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定