在线程池中始终以最大值搜索特定节点

huangapple go评论72阅读模式
英文:

Search thread_pool for particular nodes always at maximum

问题

以下是您提供的内容的翻译:

我有一个包含6个节点的Elasticsearch集群。堆大小设置为50GB。(我知道推荐的最小值是32GB,但由于某种我不知道的原因,已将其设置为50GB)。现在我看到搜索线程池中出现了很多拒绝。

这是我的当前搜索线程池:

节点名称             名称       活动中  被拒绝的  完成的
1105-IDC.node      搜索       0    19295154  1741362188
1108-IDC.node      搜索       0    3362344   1660241184
1103-IDC.node      搜索       49   28763055  1695435484
1102-IDC.node      搜索       0    7715608   1734602881
1106-IDC.node      搜索       0    14484381  1840694326
1107-IDC.node      搜索       49   22470219  1641504395

我注意到的一件事是两个节点总是具有最大的活动线程数(1103-IDC.node和1107-IDC.node)。尽管其他节点也有拒绝,但这些节点拒绝最多。硬件与其他节点相似。这可能是什么原因?是否因为它们具有某些分片的命中次数较多?如果是的话,如何找到它们?

此外,在活动线程始终最大的节点上,年轻代堆的暂停时间超过70毫秒(有时约为200毫秒)。以下是来自具有最大活动线程的节点的GC日志中的一些行:

[2020-10-27T04:32:14.380+0000][53678][gc] GC(6768757) 暂停 年轻代(分配失败) 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start] GC(6768758) 暂停 年轻代(分配失败)
[2020-10-27T04:32:26.313+0000][53678][gc] GC(6768758) 暂停 年轻代(分配失败) 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start] GC(6768759) 暂停 年轻代(分配失败)
[2020-10-27T04:32:35.574+0000][53678][gc] GC(6768759) 暂停 年轻代(分配失败) 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start] GC(6768760) 暂停 年轻代(分配失败)
[2020-10-27T04:32:41.077+0000][53678][gc] GC(6768760) 暂停 年轻代(分配失败) 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start] GC(6768761) 暂停 年轻代(分配失败)
[2020-10-27T04:32:45.200+0000][53678][gc] GC(6768761) 暂停 年轻代(分配失败) 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start] GC(6768762) 暂停 年轻代(分配失败)
[2020-10-27T04:32:47.046+0000][53678][gc] GC(6768762) 暂停 年轻代(分配失败) 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start] GC(6768763) 暂停 年轻代(分配失败)
[2020-10-27T04:32:56.719+0000][53678][gc] GC(6768763) 暂停 年轻代(分配失败) 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start] GC(6768764) 暂停 年轻代(分配失败)
[2020-10-27T04:33:29.740+0000][53678][gc] GC(6768764) 暂停 年轻代(分配失败) 28015M->26516M(51008M) 251.447ms
英文:

I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.

This is my current search thread_pool:

node_name               name   active rejected  completed
1105-IDC.node          search      0 19295154 1741362188
1108-IDC.node          search      0  3362344 1660241184
1103-IDC.node          search     49 28763055 1695435484
1102-IDC.node          search      0  7715608 1734602881
1106-IDC.node          search      0 14484381 1840694326
1107-IDC.node          search     49 22470219 1641504395

Something I have noticed is two nodes always have max active threads(1103-IDC.node & 1107-IDC.node ). Even though other nodes also have rejections, these nodes have the highest. Hardware is similar to other nodes. What could be the reason for this? Can it be due to them having any particular shards where hits are more? If yes how to find them.?

Also, the young heap is taking more than 70ms(sometimes around 200ms) on the nodes where active thread is always max. Find below some lines from the GC log:

[2020-10-27T04:32:14.380+0000][53678][gc             ] GC(6768757) Pause Young (Allocation Failure) 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start       ] GC(6768758) Pause Young (Allocation Failure)
[2020-10-27T04:32:26.313+0000][53678][gc             ] GC(6768758) Pause Young (Allocation Failure) 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start       ] GC(6768759) Pause Young (Allocation Failure)
[2020-10-27T04:32:35.574+0000][53678][gc             ] GC(6768759) Pause Young (Allocation Failure) 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start       ] GC(6768760) Pause Young (Allocation Failure)
[2020-10-27T04:32:41.077+0000][53678][gc             ] GC(6768760) Pause Young (Allocation Failure) 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start       ] GC(6768761) Pause Young (Allocation Failure)
[2020-10-27T04:32:45.200+0000][53678][gc             ] GC(6768761) Pause Young (Allocation Failure) 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start       ] GC(6768762) Pause Young (Allocation Failure)
[2020-10-27T04:32:47.046+0000][53678][gc             ] GC(6768762) Pause Young (Allocation Failure) 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start       ] GC(6768763) Pause Young (Allocation Failure)
[2020-10-27T04:32:56.719+0000][53678][gc             ] GC(6768763) Pause Young (Allocation Failure) 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start       ] GC(6768764) Pause Young (Allocation Failure)
[2020-10-27T04:33:29.740+0000][53678][gc             ] GC(6768764) Pause Young (Allocation Failure) 28015M->26516M(51008M) 251.447ms

答案1

得分: 1

需要注意的一点是,如果您从elasticsearch线程池cat API获取了这些统计数据,那么它只显示即时数据,并且不显示最近1小时、6小时、1天、1周等的历史数据。

而且,拒绝的和已完成的统计数据是从节点上次重启开始的,因此在我们试图弄清楚一些Elasticsearch节点是否因为糟糕/不平衡的分片配置而变成热点时,这也没有太大帮助。

因此,我们有两个非常重要的问题需要解决

  1. 确保我们通过按时间范围查看数据节点上的平均活动拒绝请求,来了解群集中实际的热点节点(您可以只在高峰时段进行检查)。
  2. 一旦确定了热点节点,查看分配给它们的分片,并将其与其他节点的分片进行比较,可以检查的一些指标包括:分片数量,接收更多流量的分片,接收最慢查询的分片等,再次强调,大部分指标都需要通过查看各种指标和Elasticsearch的API来确定,这可能非常耗时,并且需要大量的内部Elasticsearch知识。
英文:

One important thing to note is that, if you got these stats from elasticsearch threadpool cat API then it shows just the point-in-time data and doesn't show the historical data for the last 1 hr, 6 hr, 1 day, 1 week like that.

And rejected and completed is the stats from the last restart of the nodes, so this is also not very helpful when we are trying to figure out if some of ES nodes are becoming hot-spots due to bad/unbalanced shards configuration.

So here we have two very important things to figure out

  1. Make sure, we know the actual hotspot nodes in the cluster by looking at the average active, rejected requests on data nodes by time range(you can just check for peak hours).
  2. Once hotspot nodes are known, look at the shards allocated to them, and compare it to other nodes shards, few metric to check is, number of shards, shards receive more traffic, shards receive slowest queries, etc and again most of them you have to figure out by looking at various metrics and API of ES which can be very time consuming and requires a lot of internal ES knowledge.

huangapple
  • 本文由 发表于 2020年10月27日 12:34:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/64548363.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定