问题

我正在对一个大型索引进行聚合操作（总共数千兆字节，分成了100千兆字节的子索引）。
我所执行的查询是为了检索每个“tag”字段值的最近文档。
我的问题是查询非常慢（超过1分钟），我必须多次运行它以应用不同的（分组）过滤条件。

示例：

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {"match": {"group.keyword": "ZZZ"}},
        {
          "range": {
            "date": {
              "lt": "2023-02-02T00:00:00+0100"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "perTag": {
      "terms": {
        "field": "tag.keyword",
        "size": 65000
      },
      "aggs": {
        "theLastOfValues": {
          "top_hits": {
            "size": 1,
            "sort": [{"date": {"order": "desc"}}]
          }
        }
      }
    }
  }
}

我找到了这个帖子：https://stackoverflow.com/a/58838916/591922，它说一个按日期排序的索引可以大大提高性能。

但我也在文档中找到了这个内容：“在重新索引中排序已经被弃用。重新索引中的排序从来没有保证按顺序索引文档。”

因此，如果我不能重新索引旧文档，是否有一种方法可以将排序策略应用于索引新文档？

如果可以的话，当我查找每个标签值的最新文档时，是否可以高效地仅查看最近插入的文档（具有最近日期的文档）？

谢谢。

英文:

I am doing aggregations on a big index (several terabytes, split in 100 gb sub-indices)
The query I am doing is to retrieve for each "tag" field value the most recent document.
My problem is the query is very slow (> 1 minute) and I have to run it several time with different (group) filters.

Example :

{
  &quot;size&quot;: 0, 
  &quot;query&quot;: {&quot;bool&quot;: {
   &quot;must&quot;: [
    {&quot;match&quot;: {
      &quot;group.keyword&quot;: &quot;ZZZ&quot;
    }},
    {
      &quot;range&quot;: {
        &quot;date&quot;: {
          &quot;lt&quot;: &quot;2023-02-02T00:00:00+0100&quot;
        }
      }
    }
  ]}},
  &quot;aggs&quot;: {
    &quot;perTag&quot;: {
      &quot;terms&quot;: {
        &quot;field&quot;: &quot;tag.keyword&quot;,
        &quot;size&quot;: 65000
      },
      &quot;aggs&quot;: {
        &quot;theLastOfValues&quot;: {
          &quot;top_hits&quot;: {
            &quot;size&quot;: 1,
            &quot;sort&quot;: [{
              &quot;date&quot;: {&quot;order&quot;: &quot;desc&quot;}
            }]
          }
        }
      }
    }
  }
}

I've found this post : https://stackoverflow.com/a/58838916/591922 , which says that a date sorted index could improve performance a lot.

But I've also found this in the documentation : "Sort in reindex is deprecated. Sorting in reindex was never guaranteed to index documents in order"

So if I can't reindex old documents, is there a way to apply a sort policy to index the new documents?

If so, would it be efficient and work to only look at the recently inserted documents (documents with recent dates) when I am looking to the most recent document for each tag value?

Thanks

答案1

得分: 2

你可以尝试使用composite聚合，它允许你在较小的数据集上执行查询，同时能够更高效地分页检索所有标签，而无需一次检索所有6.5万个标签：

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "group.keyword": "ZZZ"
          }
        },
        {
          "range": {
            "date": {
              "lt": "2023-02-02T00:00:00+0100"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "tags": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "perTag": {
              "terms": {
                "field": "tag.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "theLastOfValues": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "date": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

PS：这个聚合正是为了这个目的而创建的，它是Latest Transform API的支持聚合。

英文:

You should try the composite aggregation which allows you to perform queries on a smaller dataset, yet allow you to paginate over all your tags more efficiently without having to retrieve all 65K tags in one go:

{
  &quot;size&quot;: 0,
  &quot;query&quot;: {
    &quot;bool&quot;: {
      &quot;must&quot;: [
        {
          &quot;match&quot;: {
            &quot;group.keyword&quot;: &quot;ZZZ&quot;
          }
        },
        {
          &quot;range&quot;: {
            &quot;date&quot;: {
              &quot;lt&quot;: &quot;2023-02-02T00:00:00+0100&quot;
            }
          }
        }
      ]
    }
  },
  &quot;aggs&quot;: {
    &quot;tags&quot;: {
      &quot;composite&quot;: {
        &quot;size&quot;: 100,
        &quot;sources&quot;: [
          {
            &quot;perTag&quot;: {
              &quot;terms&quot;: {
                &quot;field&quot;: &quot;tag.keyword&quot;
              }
            }
          }
        ]
      },
      &quot;aggs&quot;: {
        &quot;theLastOfValues&quot;: {
          &quot;top_hits&quot;: {
            &quot;size&quot;: 1,
            &quot;sort&quot;: [
              {
                &quot;date&quot;: {
                  &quot;order&quot;: &quot;desc&quot;
                }
              }
            ]
          }
        }
      }
    }
  }
}

PS: It's been created exactly for this purpose and is the aggregation that powers the Latest Transform API

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在大型现有数据集上改进最终数值聚合性能

问题

答案1

为什么这段并行化的代码花费的时间与非并行化的代码相同？

How to efficiently find the date of the N-th occurrence of a specific weekday in each month within a given pandas DataFrame date range?

在Go语言中，字符串变量的拼接速度是多少？

ElasticSearch 出现 ClassCastException – 将 MappingMetadata 转换为 MappingMetadata。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论