英文:
Improve performance for last values aggregation on large existing dataset
问题
我正在对一个大型索引进行聚合操作(总共数千兆字节,分成了100千兆字节的子索引)。
我所执行的查询是为了检索每个“tag”字段值的最近文档。
我的问题是查询非常慢(超过1分钟),我必须多次运行它以应用不同的(分组)过滤条件。
示例:
{
"size": 0,
"query": {
"bool": {
"must": [
{"match": {"group.keyword": "ZZZ"}},
{
"range": {
"date": {
"lt": "2023-02-02T00:00:00+0100"
}
}
}
]
}
},
"aggs": {
"perTag": {
"terms": {
"field": "tag.keyword",
"size": 65000
},
"aggs": {
"theLastOfValues": {
"top_hits": {
"size": 1,
"sort": [{"date": {"order": "desc"}}]
}
}
}
}
}
}
我找到了这个帖子:https://stackoverflow.com/a/58838916/591922,它说一个按日期排序的索引可以大大提高性能。
但我也在文档中找到了这个内容:“在重新索引中排序已经被弃用。重新索引中的排序从来没有保证按顺序索引文档。”
因此,如果我不能重新索引旧文档,是否有一种方法可以将排序策略应用于索引新文档?
如果可以的话,当我查找每个标签值的最新文档时,是否可以高效地仅查看最近插入的文档(具有最近日期的文档)?
谢谢。
英文:
I am doing aggregations on a big index (several terabytes, split in 100 gb sub-indices)
The query I am doing is to retrieve for each "tag" field value the most recent document.
My problem is the query is very slow (> 1 minute) and I have to run it several time with different (group) filters.
Example :
{
"size": 0,
"query": {"bool": {
"must": [
{"match": {
"group.keyword": "ZZZ"
}},
{
"range": {
"date": {
"lt": "2023-02-02T00:00:00+0100"
}
}
}
]}},
"aggs": {
"perTag": {
"terms": {
"field": "tag.keyword",
"size": 65000
},
"aggs": {
"theLastOfValues": {
"top_hits": {
"size": 1,
"sort": [{
"date": {"order": "desc"}
}]
}
}
}
}
}
}
I've found this post : https://stackoverflow.com/a/58838916/591922 , which says that a date sorted index could improve performance a lot.
But I've also found this in the documentation : "Sort in reindex is deprecated. Sorting in reindex was never guaranteed to index documents in order"
So if I can't reindex old documents, is there a way to apply a sort policy to index the new documents?
If so, would it be efficient and work to only look at the recently inserted documents (documents with recent dates) when I am looking to the most recent document for each tag value?
Thanks
答案1
得分: 2
你可以尝试使用composite
聚合,它允许你在较小的数据集上执行查询,同时能够更高效地分页检索所有标签,而无需一次检索所有6.5万个标签:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"group.keyword": "ZZZ"
}
},
{
"range": {
"date": {
"lt": "2023-02-02T00:00:00+0100"
}
}
}
]
}
},
"aggs": {
"tags": {
"composite": {
"size": 100,
"sources": [
{
"perTag": {
"terms": {
"field": "tag.keyword"
}
}
}
]
},
"aggs": {
"theLastOfValues": {
"top_hits": {
"size": 1,
"sort": [
{
"date": {
"order": "desc"
}
}
]
}
}
}
}
}
}
PS:这个聚合正是为了这个目的而创建的,它是Latest Transform API的支持聚合。
英文:
You should try the composite
aggregation which allows you to perform queries on a smaller dataset, yet allow you to paginate over all your tags more efficiently without having to retrieve all 65K tags in one go:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"group.keyword": "ZZZ"
}
},
{
"range": {
"date": {
"lt": "2023-02-02T00:00:00+0100"
}
}
}
]
}
},
"aggs": {
"tags": {
"composite": {
"size": 100,
"sources": [
{
"perTag": {
"terms": {
"field": "tag.keyword"
}
}
}
]
},
"aggs": {
"theLastOfValues": {
"top_hits": {
"size": 1,
"sort": [
{
"date": {
"order": "desc"
}
}
]
}
}
}
}
}
}
PS: It's been created exactly for this purpose and is the aggregation that powers the Latest Transform API
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论