问题

I have a r5.8xlarge AWS cluster with 12 nodes, so there are 6144 cores (12nodes * 32vCPU * 16cores), I have set --executor-cores=5 and enabled the dynamic execution using the below spark-submit command, even after setting the spark.dynamicAllocation.initialExecutors=150 --conf spark.dynamicAllocation.minExecutors=150, I'm only seeing 70 executors in the spark-UI application, what am I doing wrong?

我有一个由12个节点组成的r5.8xlarge AWS集群，所以有6144个核心（12个节点 * 32个vCPU * 16个核心），我已经设置了--executor-cores=5，并使用下面的spark-submit命令启用了动态执行，即使设置了spark.dynamicAllocation.initialExecutors=150 --conf spark.dynamicAllocation.minExecutors=150，但在spark-UI应用程序中只看到70个执行器，我做错了什么？

r5.8xlarge clusters have 256GB per node, so 3072GB(256GB*12nodes)

r5.8xlarge集群每个节点有256GB内存，所以总共有3072GB（256GB*12个节点）

FYI -I'm not including the driver node in this calculation.

FYI - 我没有将驱动节点包括在这个计算中。

--driver-memory 200G --deploy-mode client --executor-memory 37G --executor-cores 7 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.driver.maxResultSize=0 --conf spark.sql.shuffle.partitions=2000 --conf spark.dynamicAllocation.initialExecutors=150 --conf spark.dynamicAllocation.minExecutors=150

英文:

r5.8xlarge clusters have 256GB per node, so 3072GB(256GB*12nodes)

FYI -I'm not including the driver node in this calculation.

--driver-memory 200G --deploy-mode client --executor-memory 37G --executor-cores 7 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.driver.maxResultSize=0 --conf spark.sql.shuffle.partitions=2000 --conf spark.dynamicAllocation.initialExecutors=150  --conf spark.dynamicAllocation.minExecutors=150

答案1

得分: 1

您每个节点有256GB内存，每个执行器有37GB内存，一个执行器只能在一个节点上运行（一个执行器不能在多个节点之间共享），因此对于每个节点，您最多可以有6个执行器（256 / 37 = 6），因为您有12个节点，所以最多的执行器数量将是6 * 12 = 72个执行器，这解释了为什么您在Spark UI中只看到了70个执行器（2个执行器的差异可能是由于分配给驱动程序的内存或某些节点中的内存分配问题造成的）。

如果您想要更多的执行器，那么您必须减少执行器的内存，还要确保完全利用您的集群，确保节点内存除以执行器内存的余数尽可能接近零，例如：

每个节点256GB内存，每个执行器37GB内存：256 / 37 = 6.9 => 每个节点6个执行器（每个节点损失34GB内存）
每个节点256GB内存，每个执行器36GB内存：256 / 36 = 7.1 => 每个节点7个执行器（每个节点只损失4GB内存，因此您获得了每个节点30GB未使用的内存）

如果您想要至少150个执行器，那么执行器的内存应该最多为19GB。

英文:

You have 256GB per node and 37G per executor, an executor can only be in one node (a executor cannot be shared between multiple nodes), so for each node you will have at most 6 executors (256 / 37 = 6), since you have 12 nodes so the max number of executors will be 6 * 12 = 72 executor which explain why you see only 70 executor in your spark ui (the difference of 2 executor's is caused by the memory allocated to the driver or maybe because of some memory allocation problem in some nodes).

If you want more executors then you have to decrease the memory of the executors, also to fully utilize your cluster make sure that the reminder of the the node memory divided by the executor memory is as close to zero as possible, ex:

256GB per node and 37G per executor: 256 / 37 = 6.9 => 6 executor per node (34G lost per node)
256GB per node and 36G per executor: 256 / 36 = 7.1 => 7 executor per node ( only 4G lost per node, so you gain 30G of unused memory per node)

If you want at least 150 executor then executor memory should be at most 19G

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

当启用动态分配时，Spark的执行者数量

问题

答案1

如何在Scala中获取前一年对应的季度。

如何在Spark Kafka流中创建消费者组并将消费者分配给消费者组。

Spark – 如何获取随机唯一行

UnixTime 在 Spark/Java 中的使用

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论