问题

我正在编写一个处理内存中数据“块”的算法。我正在使用JavaPairRDD.groupByKey()来指定这些块，但我不清楚如何计算最佳的块大小。块越大，算法运行得越快。根据块大小，我可以估算我的内存使用情况，但实际上有多少执行器内存可供我使用（而不是由Spark用于自身的内存）？是否有任何方法可以以编程方式提示Spark在转换链中存在内存密集型步骤？

英文:

I am writing an algorithm that processes a "chunk" of data in memory. I'm using JavaPairRDD.groupByKey() to designate the chunks, but it is unclear to me how to calculate the optimal chunk size. The large it is, the faster the algo will run. Given the chunk size, I can estimate my memory use, but how much executor memory is actually available to me (as opposed to, claimed by Spark for its own use)? And is there any way to programmatically suggest to Spark that I have a memory-intensive step in the transformation chain?

答案1

得分: 0

不用担心，这篇文章解释得非常清楚。你会得到
(HeapSize – ReservedMemory) * (1.0 – spark.memory.fraction)
对于一个4GB的堆，假设其他参数采用默认设置，大约是1500MB。

英文:

Never mind, this post explains it really well. You get
(HeapSize – ReservedMemory) * (1.0 – spark.memory.fraction)
which for a 4GB heap is about 1500MB assuming default settings for other parameters.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark: 应用程序可用的执行内存有多少？

问题

答案1

在内存中重新分配 vs 文件中重新分配

Pyspark – Kafka集成在批处理方面可行，但对于readStream则不起作用

将数据从本地PySpark会话写入Iceberg/Glue表格。

如何从存储在ADLS中的Parquet文件中的Delta表中删除列？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论