问题

在运行一个大查询时遇到了这个问题。在出现错误之前，我们能否终止这样的查询？

io.trino.operator.PageTransportTimeoutException: 遇到了与工作节点通信时的太多错误。该节点可能已崩溃或负载过重。这可能是一个暂时性问题，请在几分钟后重试您的查询。(http://172.22.66.206:8889/v1/task/20230727_083615_00032_edi7s.0.0.0/results/0/0 - 30 次失败，失败持续时间 302.86 秒，总失败请求时间 312.86 秒)

3个节点的集群 m6g.16xlarge（协调器和2个工作节点）

node-scheduler.include-coordinator=false
discovery.uri=http://ip-172-22-69-150.ec2.internal:8889
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=3000GB
query.max-memory-per-node=60GB
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8889
http-server.log.path=/var/log/trino/http-request.log
http-server.log.max-size=67108864B
http-server.log.max-history=5
log.max-size=268435456B
jmx.rmiregistry.port = 9080
jmx.rmiserver.port = 9081
node-scheduler.max-splits-per-node = 200
experimental.query-max-spill-per-node = 50GB
graceful-shutdown-timeout = 3600s
task.concurrency = 16
query.execution-policy = phased
experimental.max-spill-per-node = 100GB
query.max-concurrent-queries = 20
query.max-total-memory = 5000GB

英文:

Facing this issue using while running a single large query. Can we kill such queries before this error occurs?
io.trino.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://172.22.66.206:8889/v1/task/20230727_083615_00032_edi7s.0.0.0/results/0/0 - 30 failures, failure duration 302.86s, total failed request time 312.86s)

3 node cluster m6g.16xlarge (coordinator and 2 worker)

node-scheduler.include-coordinator=false
discovery.uri=http://ip-172-22-69-150.ec2.internal:8889
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=3000GB
query.max-memory-per-node=60GB
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8889
http-server.log.path=/var/log/trino/http-request.log
http-server.log.max-size=67108864B
http-server.log.max-history=5
log.max-size=268435456B
jmx.rmiregistry.port = 9080
jmx.rmiserver.port = 9081
node-scheduler.max-splits-per-node = 200
experimental.query-max-spill-per-node = 50GB
graceful-shutdown-timeout = 3600s
task.concurrency = 16
query.execution-policy = phased
experimental.max-spill-per-node = 100GB
query.max-concurrent-queries = 20
query.max-total-memory = 5000GB

答案1

得分: 0

我在jvm配置中有以下标志：
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

由于这个配置，每当发生OOM时，进程会将其堆转储到磁盘并终止该进程。
工作节点的磁盘利用率过高（>95%），这导致trino进程无法启动。

在查找后，我发现OOM问题是由于这个问题引起的：https://bugs.openjdk.org/browse/JDK-8293861

为了解决这个问题，我添加了以下jvm属性：
-XX:+UnlockDiagnosticVMOptions
-XX:-G1UsePreventiveGC

这可以防止进程因为GC而导致OOM。

英文:

I had the following flags in the jvm config.
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

due to this whenever there was an OOM the process would dump its heap onto disk and kill the process.
The worker nodes disk utilization was too high (>95%) which was preventing the trino process to start.

After looking around I figured the OOM issue was due to this issue :- https://bugs.openjdk.org/browse/JDK-8293861

To fix this I added the following jvm props
-XX:+UnlockDiagnosticVMOptions
-XX:-G1UsePreventiveGC

This prevents the process from going into OOM due to GC

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

遇到了与Trino的工作节点通信时出现的太多错误。

问题

答案1

I/O超时异常 (java.net.ConnectException) 在调用API时发生

使用流将列表求和并限制总和

界面在活动和片段之间自动变为null。

What is the difference between using and not using <type>pom</type> for a dependency (in the <dependencies> section)?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论