英文:
Why there is much time cost between took_millis and timeout in elasticsearch slow query
问题
我在生产环境中遇到了一些慢查询,并配置了慢日志来查找一些慢查询的信息,例如这样的查询字符串超时为500毫秒的情况:
[2021-06-21T10:43:33,930][DEBUG][index.search.slowlog.query.xxxx] [xxx][g][2] took[1s], took_millis[1043], total_hits[424690], types[top], stats[], search_type[QUERY_THEN_FETCH], total_shards[6], source[{query_string}]
在这种情况下,查询超时时间为500毫秒,响应中的took_millis为1043毫秒。
据我所知,超时时间只对查询解析有用,而took值表示es中的执行时间,不包括一些外部阶段,比如Query timing: 'took' value and what I'm measuring。我有两个问题:
- 首先,为什么在超时时间和took_millis之间有504毫秒的差距(1043 - 500 = 504)?
- 其次,我如何知道超时时间和took_millis之间的详细时间消耗?
非常感谢!
英文:
I meet some slow queries in the production environment and I config the slow log to find some slow queries info like this(query_string with 500ms timeout):
[2021-06-21T10:43:33,930][DEBUG][index.search.slowlog.query.xxxx] [xxx][g][2] took[1s], took_millis[1043], total_hits[424690], types[top], stats[], search_typ e[QUERY_THEN_FETCH], total_shards[6], source[{query_string}]
In this case, the query timeout is 500ms, and the took_millis in response is 1043ms.
As far as I know the timeout is only useful for the query parse and the took value represents the execution time in es without some external phases like Query timing: ‘took’ value and what I’m measuring. I have two questions:
- Firstly, why there is 504ms(1043 - 500 = 504) between the timeout and took_millis?
- Secondly, how can I know the detail time cost between the timeout and took_millis time?
Thanks a lot!
答案1
得分: 1
在查询上设置超时并不能确保当执行时间超过超时时间时查询会被取消。Elasticsearch文档中指出:
“默认情况下,运行中的搜索仅在段边界上检查是否取消,因此取消可能会被大段延迟。”
这可以解释超时时间和took_millis之间的504毫秒,你的查询需要那么长时间,并且没有及时取消。
为了分析查询的执行并查看可能导致这些长时间延迟的原因,你可以使用profile API重新运行查询。请注意,如果无法重现查询的慢执行,这将无法帮助你解决问题。如果你的查询大部分时间都运行良好,请尝试将这些运行缓慢的查询与服务器负载等外部因素相关联。
英文:
Setting a timeout on a query doesn't ensure that the query is actually cancelled when its execution time surpasses that timeout. The Elasticsearch documentation states:
> "By default, a running search only checks if it is cancelled or not on
> segment boundaries, therefore the cancellation can be delayed by large
> segments."
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search.html#global-search-timeout
Check issues 3627, 4586 and 2929
This can explain the 504ms between timeout and took_millis, your query just takes that long, and it is not cancelled in time.
To analyze the query execution and see what might be causing these long delays, you can rerun the query with the profile API. Note that if the slow execution of your query cannot be reproduced, this won't help you solve the issue. If your query runs fine most of the time, try to correlate these slow-running queries with external factors such as server load.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论