在大型现有数据集上改进最终数值聚合性能

huangapple go评论64阅读模式
英文:

Improve performance for last values aggregation on large existing dataset

问题

我正在对一个大型索引进行聚合操作(总共数千兆字节,分成了100千兆字节的子索引)。
我所执行的查询是为了检索每个“tag”字段值的最近文档。
我的问题是查询非常慢(超过1分钟),我必须多次运行它以应用不同的(分组)过滤条件。

示例:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {"match": {"group.keyword": "ZZZ"}},
        {
          "range": {
            "date": {
              "lt": "2023-02-02T00:00:00+0100"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "perTag": {
      "terms": {
        "field": "tag.keyword",
        "size": 65000
      },
      "aggs": {
        "theLastOfValues": {
          "top_hits": {
            "size": 1,
            "sort": [{"date": {"order": "desc"}}]
          }
        }
      }
    }
  }
}

我找到了这个帖子:https://stackoverflow.com/a/58838916/591922,它说一个按日期排序的索引可以大大提高性能。

但我也在文档中找到了这个内容:“在重新索引中排序已经被弃用。重新索引中的排序从来没有保证按顺序索引文档。”

因此,如果我不能重新索引旧文档,是否有一种方法可以将排序策略应用于索引新文档?

如果可以的话,当我查找每个标签值的最新文档时,是否可以高效地仅查看最近插入的文档(具有最近日期的文档)?

谢谢。

英文:

I am doing aggregations on a big index (several terabytes, split in 100 gb sub-indices)
The query I am doing is to retrieve for each "tag" field value the most recent document.
My problem is the query is very slow (> 1 minute) and I have to run it several time with different (group) filters.

Example :

{
  "size": 0, 
  "query": {"bool": {
   "must": [
    {"match": {
      "group.keyword": "ZZZ"
    }},
    {
      "range": {
        "date": {
          "lt": "2023-02-02T00:00:00+0100"
        }
      }
    }
  ]}},
  "aggs": {
    "perTag": {
      "terms": {
        "field": "tag.keyword",
        "size": 65000
      },
      "aggs": {
        "theLastOfValues": {
          "top_hits": {
            "size": 1,
            "sort": [{
              "date": {"order": "desc"}
            }]
          }
        }
      }
    }
  }
}

I've found this post : https://stackoverflow.com/a/58838916/591922 , which says that a date sorted index could improve performance a lot.

But I've also found this in the documentation : "Sort in reindex is deprecated. Sorting in reindex was never guaranteed to index documents in order"

So if I can't reindex old documents, is there a way to apply a sort policy to index the new documents?

If so, would it be efficient and work to only look at the recently inserted documents (documents with recent dates) when I am looking to the most recent document for each tag value?

Thanks

答案1

得分: 2

你可以尝试使用composite聚合,它允许你在较小的数据集上执行查询,同时能够更高效地分页检索所有标签,而无需一次检索所有6.5万个标签:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "group.keyword": "ZZZ"
          }
        },
        {
          "range": {
            "date": {
              "lt": "2023-02-02T00:00:00+0100"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "tags": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "perTag": {
              "terms": {
                "field": "tag.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "theLastOfValues": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "date": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

PS:这个聚合正是为了这个目的而创建的,它是Latest Transform API的支持聚合。

英文:

You should try the composite aggregation which allows you to perform queries on a smaller dataset, yet allow you to paginate over all your tags more efficiently without having to retrieve all 65K tags in one go:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "group.keyword": "ZZZ"
          }
        },
        {
          "range": {
            "date": {
              "lt": "2023-02-02T00:00:00+0100"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "tags": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "perTag": {
              "terms": {
                "field": "tag.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "theLastOfValues": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "date": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

PS: It's been created exactly for this purpose and is the aggregation that powers the Latest Transform API

huangapple
  • 本文由 发表于 2023年2月23日 21:34:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75545558.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定