2023年6月29日 17:13:11go评论101阅读模式

英文:

Index historical time-series data into an Elasticsearch data stream - ILM

问题

我的用例如下：我持续产生时间序列数据+一年的历史数据。我想以这样的方式将它们索引到Elasticsearch中，即数据将在一年后根据@timestamp字段删除。

数据流似乎是新生成的时间序列数据的完美解决方案。它们在创建时立即被索引，ILM将在一年后的适当时机删除相关的后备索引。

然而，我在处理历史数据时遇到了困难。如何以这样的方式索引它们，以便在合适的时间删除历史数据？由于滚动是基于索引的年龄而不是文档的@timestamp字段，所有相关的后备索引也将在一年后被删除，即使它们包含更旧的数据。在我的用例中，这通常意味着最旧的历史数据将在集群中保留两年，这不是预期的行为。

您有任何想要修复这个问题的想法吗？

英文:

My use case is the following : I have continuously produced time-series data + one year history. I want to index them into Elasticsearch in such a way that data is deleted after one year (according to the @timestamp field).

Data streams seem to be the perfect solution for the newly producted time-series data. They get indexed as soon as they are created, and the ILM will delete the associated backing indices at the right moment in one year.

However, I'm stuck with the historical datas. How to index them in such a way that the historical data will be deleted at the right time ? As the rollover is based on the index age and not the documents @timestamp fields, all associated backing indices will be also deleted in one year, even if they contains older data. In my use case, this typically means that the oldest historical data will remain two years in the cluster, which is not the expected behaviour.

Do you have any ideas to fix this ?

答案1

得分: 1

以下是翻译好的内容：

您有可能覆盖此行为并提供自己的 index.lifecycle.origination_date。

如果指定了此值，将用于计算索引年龄以进行阶段转换。如果您创建一个包含旧数据的新索引并希望使用原始创建日期来计算索引年龄，请使用此设置。以毫秒的Unix纪元值指定。

因此，您可以将旧数据索引到数据流中，并为每个后备索引动态设置应与索引创建日期对应的时间戳，就好像那些旧历史数据当时已经被索引一样。

PUT .ds-index-xxx/_settings
{
   "index.lifecycle.origination_date": "2020-01-01"
}

您可以使用以下查询找到每个后备索引要使用的最大时间戳：

POST index/_search
{
  "size": 0,
  "aggs": {
    "index": {
      "terms": {
        "field": "_index"
      },
      "aggs": {
        "date": {
          "max": {
            "field": "@timestamp"
          }
        }
      }
    }
  }
}

英文:

You have the possibility to override this behavior and provide your own index.lifecycle.origination_date

> If specified, this is the timestamp used to calculate the index age for its phase transitions. Use this setting if you create a new index that contains old data and want to use the original creation date to calculate the index age. Specified as a Unix epoch value in milliseconds.

So you can index your old data into your data streams and for each backing index you can dynamically set the timestamp that should correspond to the date the index would have been created if that old historical data had been indexed back then.

PUT .ds-index-xxx/_settings
{
   &quot;index.lifecycle.origination_date&quot;: &quot;2020-01-01&quot;
}

You can find the max timestamp to use for each backing index using the following query:

POST index/_search
{
  &quot;size&quot;: 0,
  &quot;aggs&quot;: {
    &quot;index&quot;: {
      &quot;terms&quot;: {
        &quot;field&quot;: &quot;_index&quot;
      },
      &quot;aggs&quot;: {
        &quot;date&quot;: {
          &quot;max&quot;: {
            &quot;field&quot;: &quot;@timestamp&quot;
          }
        }
      }
    }
  }
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将历史时间序列数据索引到Elasticsearch数据流 – ILM

问题

答案1

ElasticSearch .NET DeleteByQueryAsync 在 Elastic.Clients.Elasticsearch 8.9.1 上的使用方法。

How can I put a format specifier in Elasticsearch query using Go?

Improving Elasticsearch indexing performance.

SpanNot Lucene Query 要么太严格，要么太宽松。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。