英文:
Rejections from search thread_pool Elasticsearch
问题
我有一个包含6个节点的Elasticsearch集群。堆大小设置为50GB。(我知道推荐的大小不应小于32GB,但出于某些我不清楚的原因,已将其设置为50GB)。现在我在搜索线程池中看到了很多拒绝。
我还经常在gc日志中看到“Pause Young(分配失败)”。
“Pause Young(分配失败)”发生得如此频繁,这是正常的吗?以下是堆池的设置:
堆池设置:
"pools": {
"young": {
"used_in_bytes": 731499608,
"max_in_bytes": 1605304320,
"peak_used_in_bytes": 1605304320,
"peak_max_in_bytes": 1605304320
},
"survivor": {
"used_in_bytes": 187117000,
"max_in_bytes": 200605696,
"peak_used_in_bytes": 200605696,
"peak_max_in_bytes": 200605696
},
"old": {
"used_in_bytes": 16461553480,
"max_in_bytes": 51680575488,
"peak_used_in_bytes": 50144794688,
"peak_max_in_bytes": 51680575488
}
}
拒绝是否是由于堆大小或线程池过低导致的?为什么“Pause Young(分配失败)”会如此频繁发生?
(以上为您提供的内容的翻译,不包含额外内容。)
英文:
I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.
[2020-10-26T12:50:32,730][DEBUG][o.e.a.s.TransportSearchAction] [IDC.node1.com] [indice_1][3], node[M4pYcHhzSTIj5NXA], [P], s[STARTED], a[id=IAQaCGTCTA-CkKGPw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[indice_1], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[_doc], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [IDC.node1.com][node1:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@21060fcc on QueueResizingEsThreadPoolExecutor[name = IDC.node1.com/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.2s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@3c46a309[Running, pool size = 49, active threads = 49, queued tasks = 1063, completed tasks = 1693218669]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-7.4.2.jar:7.4.2]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:84) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$8(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.index.shard.IndexShard.awaitShardSearchActive(IndexShard.java:3247) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$9(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:114) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:87) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.rewriteShardRequest(SearchService.java:1048) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$6(SearchTransportService.java:335) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:185) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
I can also "Pause Young (Allocation Failure)" happening constantly in gc log.
[2020-10-26T07:31:06.514+0000][61585][gc,start ] GC(7139651) Pause Young (Allocation Failure)
[2020-10-26T07:31:06.514+0000][61585][gc,task ] GC(7139651) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 1: 40802096 bytes, 40802096 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 2: 16714352 bytes, 57516448 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 3: 8509976 bytes, 66026424 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 4: 6906208 bytes, 72932632 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 5: 4186344 bytes, 77118976 total
[2020-10-26T07:31:06.574+0000][61585][gc,age ] GC(7139651) - age 6: 1960288 bytes, 79079264 total
[2020-10-26T07:31:06.574+0000][61585][gc,heap ] GC(7139651) ParNew: 1631431K->102498K(1763584K)
[2020-10-26T07:31:06.574+0000][61585][gc,heap ] GC(7139651) CMS: 35964431K->35966764K(50469312K)
[2020-10-26T07:31:06.574+0000][61585][gc,metaspace ] GC(7139651) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:06.574+0000][61585][gc ] GC(7139651) Pause Young (Allocation Failure) 36714M->35223M(51008M) 59.928ms
[2020-10-26T07:31:06.574+0000][61585][gc,cpu ] GC(7139651) User=1.02s Sys=0.00s Real=0.06s
[2020-10-26T07:31:06.574+0000][61585][safepoint ] Safepoint "GenCollectForAllocation", Time since last: 675463951 ns, Reaching safepoint: 1382982 ns, At safepoint: 60208788 ns, Total: 61591770 ns
[2020-10-26T07:31:07.575+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000129984 ns, Reaching safepoint: 901989 ns, At safepoint: 39274 ns, Total: 941263 ns
[2020-10-26T07:31:08.529+0000][61585][gc,start ] GC(7139652) Pause Young (Allocation Failure)
[2020-10-26T07:31:08.529+0000][61585][gc,task ] GC(7139652) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:08.584+0000][61585][gc,age ] GC(7139652) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:08.584+0000][61585][gc,age ] GC(7139652) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:08.584+0000][61585][gc,age ] GC(7139652) - age 1: 17722072 bytes, 17722072 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 2: 15204416 bytes, 32926488 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 3: 16712976 bytes, 49639464 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 4: 8509664 bytes, 58149128 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 5: 6501424 bytes, 64650552 total
[2020-10-26T07:31:08.585+0000][61585][gc,age ] GC(7139652) - age 6: 3047280 bytes, 67697832 total
[2020-10-26T07:31:08.585+0000][61585][gc,heap ] GC(7139652) ParNew: 1670178K->89020K(1763584K)
[2020-10-26T07:31:08.585+0000][61585][gc,heap ] GC(7139652) CMS: 35966764K->35968680K(50469312K)
[2020-10-26T07:31:08.585+0000][61585][gc,metaspace ] GC(7139652) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:08.585+0000][61585][gc ] GC(7139652) Pause Young (Allocation Failure) 36754M->35212M(51008M) 56.177ms
[2020-10-26T07:31:08.585+0000][61585][gc,cpu ] GC(7139652) User=0.87s Sys=0.00s Real=0.05s
[2020-10-26T07:31:08.585+0000][61585][safepoint ] Safepoint "GenCollectForAllocation", Time since last: 948848143 ns, Reaching safepoint: 4487353 ns, At safepoint: 56505105 ns, Total: 60992458 ns
[2020-10-26T07:31:09.586+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000188782 ns, Reaching safepoint: 807865 ns, At safepoint: 31141 ns, Total: 839006 ns
[2020-10-26T07:31:10.587+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000218955 ns, Reaching safepoint: 947598 ns, At safepoint: 40468 ns, Total: 988066 ns
[2020-10-26T07:31:11.588+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000252002 ns, Reaching safepoint: 688820 ns, At safepoint: 24744 ns, Total: 713564 ns
[2020-10-26T07:31:12.590+0000][61585][safepoint ] Safepoint "Cleanup", Time since last: 1000168389 ns, Reaching safepoint: 978766 ns, At safepoint: 30805 ns, Total: 1009571 ns
[2020-10-26T07:31:13.497+0000][61585][gc,start ] GC(7139653) Pause Young (Allocation Failure)
[2020-10-26T07:31:13.498+0000][61585][gc,task ] GC(7139653) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 1: 11416792 bytes, 11416792 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 2: 4119704 bytes, 15536496 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 3: 15256408 bytes, 30792904 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 4: 17073416 bytes, 47866320 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 5: 8509616 bytes, 56375936 total
[2020-10-26T07:31:13.560+0000][61585][gc,age ] GC(7139653) - age 6: 5802544 bytes, 62178480 total
[2020-10-26T07:31:13.560+0000][61585][gc,heap ] GC(7139653) ParNew: 1656700K->91797K(1763584K)
[2020-10-26T07:31:13.560+0000][61585][gc,heap ] GC(7139653) CMS: 35968680K->35971272K(50469312K)
[2020-10-26T07:31:13.560+0000][61585][gc,metaspace ] GC(7139653) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:13.561+0000][61585][gc ] GC(7139653) Pause Young (Allocation Failure) 36743M->35217M(51008M) 63.001ms
[2020-10-26T07:31:13.561+0000][61585][gc,cpu ] GC(7139653) User=0.99s Sys=0.00s Real=0.07s
[2020-10-26T07:31:13.561+0000][61585][safepoint ] Safepoint "GenCollectForAllocation", Time since last: 906329704 ns, Reaching safepoint: 1247593 ns, At safepoint: 63668517 ns, Total: 64916110 ns
Should this happen this frequently?. Below is heap pool settings:
"pools" : {
"young" : {
"used_in_bytes" : 731499608,
"max_in_bytes" : 1605304320,
"peak_used_in_bytes" : 1605304320,
"peak_max_in_bytes" : 1605304320
},
"survivor" : {
"used_in_bytes" : 187117000,
"max_in_bytes" : 200605696,
"peak_used_in_bytes" : 200605696,
"peak_max_in_bytes" : 200605696
},
"old" : {
"used_in_bytes" : 16461553480,
"max_in_bytes" : 51680575488,
"peak_used_in_bytes" : 50144794688,
"peak_max_in_bytes" : 51680575488
}
Are the rejections because of heapsize or cause the thread_pool being low. And why is "Pause Young" happening this rapidly?
Update
This update was asked as another question: https://stackoverflow.com/questions/64548363/search-thread-pool-for-particular-nodes-always-at-maximum
答案1
得分: 3
search thread_pool有关联的队列,其中搜索工作线程正在处理,并且其默认容量为1000(请注意,这是每个节点的容量),在您的情况下被消耗,请参考有关更多信息,请参阅ES中的线程池。
现在,队列容量可能因以下原因之一而耗尽:
- ES节点上的高搜索请求速率,您的节点可能无法跟上。
- 一些处理时间较长的缓慢查询,导致其他请求在搜索队列中等待,最终导致拒绝。
- 集群中的节点和分片没有得到适当的平衡,这可能导致少数节点接收大量搜索查询。
您应该查看搜索慢日志,以筛选出在问题期间造成成本的查询。
如果这是合理的搜索流量增加,则应扩展集群以处理更多搜索请求。
关于年轻代堆日志,如果它不花费更多时间且在50-70毫秒以下,那么问题不是很大,您应该专注于解决可能会解决或降低堆问题的搜索拒绝问题,在任何情况下,您的最大堆大小都应低于32 GB,以获得更好的性能。
英文:
search thread_pool has an associated queue on which search worker threads work on and its default capacity is 1000(note its per node) which is consumed in your case, please refer thread_pools in ES for more info.
Now, queue capacity can be exhausted due to one of the below reasons:
- High search requests rate on your ES node, which your node might not be able to keep up with.
- Some slow queries which are taking a lot more time to process, and causing other requests to wait in the search queue, and ultimately leading to rejection.
- Your nodes and shards are not properly balanced in your cluster, this can cause few nodes to receive a high number of search queries.
You should look at the search slow-logs to filter the costly queries during the issue.
And if its legitimate increase in search traffic, you should scale your cluster to handle more search requests.
Regarding young heap logs, if it's not taking more time and under 50-70ms its not that bigger concern, and you should focus on fixing the search rejection issue which might solve or lower the heap issue and in any case you should have Max heap size lower than 32 GB for better performance.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论