2023年6月19日 22:09:16go评论169阅读模式

英文:

Elasticsearch python query to get all logs in the last 24 hours but count exceeds 10,000

问题

我目前正在处理一个项目，其中我连接到一个Elasticsearch服务器/数据库/集群，不管技术术语是什么，我的目标是获取过去24小时内的所有日志以进行解析。我现在可以获取日志，但最多只能获取10,000条。作为参考，在过去的24小时内，我使用的数据库中总共有约1000万条日志。

对于Python部分，我使用requests库向Elasticsearch发出HTTP请求。我的当前查询只有一个参数，即size = 10,000。

我想知道在这种情况下应该使用哪种方法/查询？我看到关于滚动ID或时间点API的一些信息，但不确定哪种对我的情况最好，因为有这么多日志。

我刚刚尝试增加大小，但由于日志太多而出现错误，所以这不太适用。

英文:

I am currenly working on something where i connect to an Elasticsearch server/database/cluster, whatever the technical term is, and my goal is to grab all the logs in the last 24 hours for parsing. I can grab logs right now but it only grabs a max of 10,000. For reference, within the last 24 hours there have been about 10 million logs total within the database that I am using.

For the python, I make a http request to elasticsearch using the requests library. My current query only has the paramteter size = 10,000.

I am wondering what method/ what query to use for this case? I have seen things about a scroll id or point in time API, but i am not sure what is the best for my case since there are so many logs.

I have just tried increasing the size to a lot more but that does not work well since there are so many logs and it errors out.

答案1

得分: 1

使用滚动API，它是为您的用例而设计的。

尽管不再鼓励深度分页使用滚动API，但如果您正在运行内部（日志记录）应用程序，滚动的性能影响不应该是一个问题，因为您不会有很多查询要处理。

许多Elasticsearch部署专注于日志记录使用索引生命周期策略，每天创建一个新的索引（例如my-logs-2023-06-20），并自动将日志摄入到该索引中。一旦一天结束，索引将被设置为只读，您可以自动将索引迁移到存储成本较低的冷层。

这里有一个示例ILM策略供您考虑。

如果数百个索引听起来像噩梦，别担心，您可以创建一个别名，这样您就可以查询all-my-logs以搜索所有索引。

英文:

use the scroll API, it was designed for your use-case.

The Scroll API is no longer encouraged for deep pagination, however, if you are running an internal (logging) application, the performance impacts of scroll shouldn't be an issue as you will not have many queries to serve.

Many Elasticsearch deployments focused on logging use Index Lifecycle Policies, to create a new index each day (e.g. my-logs-2023-06-20), and the logs are ingested into that index automatically. Once the day is over, the index would be made read-only, and you could automatically migrate the index to colder tiers with a reduced storage cost.

Here's an example ILM policy you may want to consider.

If hundreds of indices sounds like a nightmare, don't worry, you can create an alias so you could query all-my-logs to search all the indices.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Elasticsearch Python查询以获取过去24小时内的所有日志，但计数超过10,000。

问题

答案1

如何防止pyplot.errorbar导致seaborn barplot的x轴位移

无法使用Pandas在多个列上进行连接。

在pipeline中使用sklearn的FunctionTransformer功能时，使用numpy。

如何从依赖于参数的二维数组中定义一个三维数组？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论