2023年3月9日 22:11:23go评论60阅读模式

英文:

Update by query is very slow in elastic search

问题

我正在尝试使用 Painless 脚本和 _update_by_query 将一个具有 object 类型的字段的数据复制到另一个具有 nested 类型的字段。我的 Elasticsearch 集群花费了约 48 小时来复制超过 840 万的数据，我想加速这个操作。

这是 URL：

localhost:9200/myindex/_update_by_query?conflicts=proceed&amp;requests_per_second=50

这是请求体：

{
    "query": {
        "match_all": {}
    },
    "script": {
        "inline": "ctx._source.myfield_copy = ctx._source.myfield;"
    }
}

英文:

I am trying to copy the data of one field with object type to another field with nested type, using painless script and _update_by_query. It took my Elasticsearch cluster ca. 48 hours to copy over 8.4M worth of data, and I want to speed up this operation.

This is the URL:

localhost:9200/myindex/_update_by_query?conflicts=proceed&amp;requests_per_second=50

and here is the body:

{
    &quot;query&quot;: {
        &quot;match_all&quot;: {}
    },
    &quot;script&quot;: {
        &quot;inline&quot;: &quot;ctx._source.myfield_copy = ctx._source.myfield;&quot;
    }
}

Any way to speed up this operation?

答案1

得分: 3

通过指定 requests_per_second，您正在限制速率操作。

由于默认的批处理大小是1000，并且假设50条记录的写入时间为100毫秒（假设，实际情况可能有所不同），我们有

target_time = 1000 / 50 每秒 = 20 秒
wait_time = target_time - write_time = 20 秒 - 0.1 秒 = 19.9 秒

所以您基本上在每个批处理之间等待了19.9个虚拟秒，这可能是为什么花费这么长时间的原因。只需完全删除 requests_per_second，应该已经会有很大帮助。

如果您只是将一个字段复制到另一个字段，另一个性能提升的方法是不使用脚本，而是使用摄取管道

PUT _ingest/pipeline/copy-field
{
  "processors": [
    {
      "set": {
        "field": "myfield_copy",
        "copy_from": "myfield"
      }
    }
  ]
}

然后只需通过引用该管道运行更新查询：

POST myindex/_update_by_query?pipeline=copy-field&amp;wait_for_completion=false

英文:

By specifying requests_per_second you're throttling the operation.

Since the default bath size is 1000, and admitting the write time for 50 records is 100ms (hypothesis, your mileage may vary), we have

target_time = 1000 / 50 per second = 20 seconds
wait_time = target_time - write_time = 20 seconds - 0.1 seconds = 19.9 seconds

So you're basically waiting 19.9 artificial seconds between each batch, which is probably why it takes so long. Just remove requests_per_second altogether and that should already help a lot.

If you're just copying a field into another, another performance improvement would be to not do it with a script but with an ingest pipeline

PUT _ingest/pipeline/copy-field
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;myfield_copy&quot;,
        &quot;copy_from&quot;: &quot;myfield&quot;
      }
    }
  ]
}

And then simply run update by query by referencing that pipeline:

POST myindex/_update_by_query?pipeline=copy-field&amp;wait_for_completion=false

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Elasticsearch中，通过查询进行更新速度非常慢。

问题

答案1

理解 Gradle 项目中的流程

Is there a way to set JSON object to elasticsearch index as key-value pairs via AsyncLogger using log4j2-elasticsearch-hc appender

将数据加载到Elasticsearch v7.3使用Bulk API。

Elasticsearch查询多维数组数值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论