将历史时间序列数据索引到Elasticsearch数据流 – ILM

huangapple go评论85阅读模式
英文:

Index historical time-series data into an Elasticsearch data stream - ILM

问题

我的用例如下:我持续产生时间序列数据+一年的历史数据。我想以这样的方式将它们索引到Elasticsearch中,即数据将在一年后根据@timestamp字段删除。

数据流似乎是新生成的时间序列数据的完美解决方案。它们在创建时立即被索引,ILM将在一年后的适当时机删除相关的后备索引。

然而,我在处理历史数据时遇到了困难。如何以这样的方式索引它们,以便在合适的时间删除历史数据?由于滚动是基于索引的年龄而不是文档的@timestamp字段,所有相关的后备索引也将在一年后被删除,即使它们包含更旧的数据。在我的用例中,这通常意味着最旧的历史数据将在集群中保留两年,这不是预期的行为。

您有任何想要修复这个问题的想法吗?

英文:

My use case is the following : I have continuously produced time-series data + one year history. I want to index them into Elasticsearch in such a way that data is deleted after one year (according to the @timestamp field).

Data streams seem to be the perfect solution for the newly producted time-series data. They get indexed as soon as they are created, and the ILM will delete the associated backing indices at the right moment in one year.

However, I'm stuck with the historical datas. How to index them in such a way that the historical data will be deleted at the right time ? As the rollover is based on the index age and not the documents @timestamp fields, all associated backing indices will be also deleted in one year, even if they contains older data. In my use case, this typically means that the oldest historical data will remain two years in the cluster, which is not the expected behaviour.

Do you have any ideas to fix this ?

答案1

得分: 1

以下是翻译好的内容:

您有可能覆盖此行为并提供自己的 index.lifecycle.origination_date

如果指定了此值,将用于计算索引年龄以进行阶段转换。如果您创建一个包含旧数据的新索引并希望使用原始创建日期来计算索引年龄,请使用此设置。以毫秒的Unix纪元值指定。

因此,您可以将旧数据索引到数据流中,并为每个后备索引动态设置应与索引创建日期对应的时间戳,就好像那些旧历史数据当时已经被索引一样。

PUT .ds-index-xxx/_settings
{
   "index.lifecycle.origination_date": "2020-01-01"
}

您可以使用以下查询找到每个后备索引要使用的最大时间戳:

POST index/_search
{
  "size": 0,
  "aggs": {
    "index": {
      "terms": {
        "field": "_index"
      },
      "aggs": {
        "date": {
          "max": {
            "field": "@timestamp"
          }
        }
      }
    }
  }
}
英文:

You have the possibility to override this behavior and provide your own index.lifecycle.origination_date

> If specified, this is the timestamp used to calculate the index age for its phase transitions. Use this setting if you create a new index that contains old data and want to use the original creation date to calculate the index age. Specified as a Unix epoch value in milliseconds.

So you can index your old data into your data streams and for each backing index you can dynamically set the timestamp that should correspond to the date the index would have been created if that old historical data had been indexed back then.

PUT .ds-index-xxx/_settings
{
   "index.lifecycle.origination_date": "2020-01-01"
}

You can find the max timestamp to use for each backing index using the following query:

POST index/_search
{
  "size": 0,
  "aggs": {
    "index": {
      "terms": {
        "field": "_index"
      },
      "aggs": {
        "date": {
          "max": {
            "field": "@timestamp"
          }
        }
      }
    }
  }
}

huangapple
  • 本文由 发表于 2023年6月29日 17:13:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76579682.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定