问题

以下是您提供的内容的翻译：

我有一个包含6个节点的Elasticsearch集群。堆大小设置为50GB。（我知道推荐的最小值是32GB，但由于某种我不知道的原因，已将其设置为50GB）。现在我看到搜索线程池中出现了很多拒绝。

这是我的当前搜索线程池：

节点名称             名称       活动中  被拒绝的  完成的
1105-IDC.node      搜索       0    19295154  1741362188
1108-IDC.node      搜索       0    3362344   1660241184
1103-IDC.node      搜索       49   28763055  1695435484
1102-IDC.node      搜索       0    7715608   1734602881
1106-IDC.node      搜索       0    14484381  1840694326
1107-IDC.node      搜索       49   22470219  1641504395

我注意到的一件事是两个节点总是具有最大的活动线程数（1103-IDC.node和1107-IDC.node）。尽管其他节点也有拒绝，但这些节点拒绝最多。硬件与其他节点相似。这可能是什么原因？是否因为它们具有某些分片的命中次数较多？如果是的话，如何找到它们？

此外，在活动线程始终最大的节点上，年轻代堆的暂停时间超过70毫秒（有时约为200毫秒）。以下是来自具有最大活动线程的节点的GC日志中的一些行：

[2020-10-27T04:32:14.380+0000][53678][gc] GC(6768757) 暂停 年轻代（分配失败） 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start] GC(6768758) 暂停 年轻代（分配失败）
[2020-10-27T04:32:26.313+0000][53678][gc] GC(6768758) 暂停 年轻代（分配失败） 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start] GC(6768759) 暂停 年轻代（分配失败）
[2020-10-27T04:32:35.574+0000][53678][gc] GC(6768759) 暂停 年轻代（分配失败） 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start] GC(6768760) 暂停 年轻代（分配失败）
[2020-10-27T04:32:41.077+0000][53678][gc] GC(6768760) 暂停 年轻代（分配失败） 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start] GC(6768761) 暂停 年轻代（分配失败）
[2020-10-27T04:32:45.200+0000][53678][gc] GC(6768761) 暂停 年轻代（分配失败） 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start] GC(6768762) 暂停 年轻代（分配失败）
[2020-10-27T04:32:47.046+0000][53678][gc] GC(6768762) 暂停 年轻代（分配失败） 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start] GC(6768763) 暂停 年轻代（分配失败）
[2020-10-27T04:32:56.719+0000][53678][gc] GC(6768763) 暂停 年轻代（分配失败） 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start] GC(6768764) 暂停 年轻代（分配失败）
[2020-10-27T04:33:29.740+0000][53678][gc] GC(6768764) 暂停 年轻代（分配失败） 28015M->26516M(51008M) 251.447ms

英文:

I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.

This is my current search thread_pool:

node_name               name   active rejected  completed
1105-IDC.node          search      0 19295154 1741362188
1108-IDC.node          search      0  3362344 1660241184
1103-IDC.node          search     49 28763055 1695435484
1102-IDC.node          search      0  7715608 1734602881
1106-IDC.node          search      0 14484381 1840694326
1107-IDC.node          search     49 22470219 1641504395

Something I have noticed is two nodes always have max active threads(1103-IDC.node & 1107-IDC.node ). Even though other nodes also have rejections, these nodes have the highest. Hardware is similar to other nodes. What could be the reason for this? Can it be due to them having any particular shards where hits are more? If yes how to find them.?

Also, the young heap is taking more than 70ms(sometimes around 200ms) on the nodes where active thread is always max. Find below some lines from the GC log:

[2020-10-27T04:32:14.380+0000][53678][gc             ] GC(6768757) Pause Young (Allocation Failure) 27884M-&gt;26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start       ] GC(6768758) Pause Young (Allocation Failure)
[2020-10-27T04:32:26.313+0000][53678][gc             ] GC(6768758) Pause Young (Allocation Failure) 27897M-&gt;26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start       ] GC(6768759) Pause Young (Allocation Failure)
[2020-10-27T04:32:35.574+0000][53678][gc             ] GC(6768759) Pause Young (Allocation Failure) 27975M-&gt;26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start       ] GC(6768760) Pause Young (Allocation Failure)
[2020-10-27T04:32:41.077+0000][53678][gc             ] GC(6768760) Pause Young (Allocation Failure) 27975M-&gt;26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start       ] GC(6768761) Pause Young (Allocation Failure)
[2020-10-27T04:32:45.200+0000][53678][gc             ] GC(6768761) Pause Young (Allocation Failure) 27958M-&gt;26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start       ] GC(6768762) Pause Young (Allocation Failure)
[2020-10-27T04:32:47.046+0000][53678][gc             ] GC(6768762) Pause Young (Allocation Failure) 28001M-&gt;26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start       ] GC(6768763) Pause Young (Allocation Failure)
[2020-10-27T04:32:56.719+0000][53678][gc             ] GC(6768763) Pause Young (Allocation Failure) 28027M-&gt;26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start       ] GC(6768764) Pause Young (Allocation Failure)
[2020-10-27T04:33:29.740+0000][53678][gc             ] GC(6768764) Pause Young (Allocation Failure) 28015M-&gt;26516M(51008M) 251.447ms

答案1

得分: 1

需要注意的一点是，如果您从elasticsearch线程池cat API获取了这些统计数据，那么它只显示即时数据，并且不显示最近1小时、6小时、1天、1周等的历史数据。

而且，拒绝的和已完成的统计数据是从节点上次重启开始的，因此在我们试图弄清楚一些Elasticsearch节点是否因为糟糕/不平衡的分片配置而变成热点时，这也没有太大帮助。

因此，我们有两个非常重要的问题需要解决：

确保我们通过按时间范围查看数据节点上的平均活动拒绝请求，来了解群集中实际的热点节点（您可以只在高峰时段进行检查）。
一旦确定了热点节点，查看分配给它们的分片，并将其与其他节点的分片进行比较，可以检查的一些指标包括：分片数量，接收更多流量的分片，接收最慢查询的分片等，再次强调，大部分指标都需要通过查看各种指标和Elasticsearch的API来确定，这可能非常耗时，并且需要大量的内部Elasticsearch知识。

英文:

One important thing to note is that, if you got these stats from elasticsearch threadpool cat API then it shows just the point-in-time data and doesn't show the historical data for the last 1 hr, 6 hr, 1 day, 1 week like that.

And rejected and completed is the stats from the last restart of the nodes, so this is also not very helpful when we are trying to figure out if some of ES nodes are becoming hot-spots due to bad/unbalanced shards configuration.

So here we have two very important things to figure out

Make sure, we know the actual hotspot nodes in the cluster by looking at the average active, rejected requests on data nodes by time range(you can just check for peak hours).
Once hotspot nodes are known, look at the shards allocated to them, and compare it to other nodes shards, few metric to check is, number of shards, shards receive more traffic, shards receive slowest queries, etc and again most of them you have to figure out by looking at various metrics and API of ES which can be very time consuming and requires a lot of internal ES knowledge.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在线程池中始终以最大值搜索特定节点

问题

答案1

Apache Beam + Dataflow对于仅1.8万数据速度太慢

用SimpleCursorAdapter填充单个TextView，使用多个SQL列（搜索框建议）

从当前日期时间中减去1年，格式为yyyy-MM-dd HH:mm:ss.SSS。

如何使调用线程周期性地等待ScheduledExecutorService中的任务完成工作。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论