2020年8月25日 00:45:40go评论172阅读模式

英文:

Spark memory fraction vs Young Generation/Old Generation java heap split

问题

以下是您要翻译的内容：

我正在学习Spark，并且对执行程序内存划分有一些疑问。具体而言，在Spark Apache文档中（此处）指出：

> Java堆空间分为两个区域：年轻代和老年代。年轻代用于存放生命周期短的对象，而老年代用于存放生命周期较长的对象。

如下图所示：

但是对于Spark Executor，根据Spark Apache文档中的说明（此处）存在另一种内存的抽象划分：

> Spark中的内存使用主要分为两类：执行和存储。执行内存是用于洗牌、连接、排序和聚合等计算的内存，而存储内存是用于缓存和在集群中传播内部数据的内存。在Spark中，执行和存储共享一个统一的区域（M）。

如下图所示：

我不明白年轻代/老年代如何与存储/执行内存重叠，因为在同一份文档中（仍然在这里）指出：

> spark.memory.fraction将M的大小表示为（JVM堆空间 - 300MiB）的一部分（默认为0.6）。其余空间（40%）用于用户数据结构、Spark中的内部元数据以及在稀疏和异常大记录的情况下防止OOM错误。

其中spark.memory.fraction代表了Java堆的执行/存储内存部分。

但是
> 如果OldGen接近满了，请减少用于缓存的内存量，降低spark.memory.fraction；缓存较少的对象比减慢任务执行更好。

这似乎暗示了OldGen实际上是用户内存，但以下陈述似乎与我的假设相矛盾：

> 如果OldGen接近满了，或者考虑减小年轻代的大小。

我没有注意到什么？

年轻代/老年代如何与Spark分数/用户内存的划分相关？

英文:

I am studying Spark and I have some doubts regarding the Executor memory split. Specifically, in the Spark Apache documentation (here) is stated that:

> Java Heap space is divided in to two regions Young and Old. The Young
> generation is meant to hold short-lived objects while the Old
> generation is intended for objects with longer lifetimes.

this one:

But for the Spark Executor there is another abstract split for the memory, as stated by spark apache doc (here):

> Memory usage in Spark largely falls under one of two categories:
> execution and storage. Execution memory refers to that used for
> computation in shuffles, joins, sorts and aggregations, while storage
> memory refers to that used for caching and propagating internal data
> across the cluster. In Spark, execution and storage share a unified
> region (M).

As shown here:

I don't understand how Young Gen\Old gen are overlapped with storage\execution memory, because in the same doc (always here) is stated that:

> spark.memory.fraction expresses the size of M as a fraction of the
> (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%)
> is reserved for user data structures, internal metadata in Spark, and
> safeguarding against OOM errors in the case of sparse and unusually
> large records.

Where spark.memory.fraction represent the execution\storage memory part of the Java Heap

But
> If the OldGen is close to being full, reduce the amount of memory used
> for caching by lowering spark.memory.fraction; it is better to cache
> fewer objects than to slow down task execution.

This seems suggesting that the oldgen is in fact the User Memory, but the following statement seems to contradict my hypothesis

> If the OldGen is close to being full, alternatively, consider decreasing the size of the Young generation.

What am I no seeing?

How is Young Gen\Old Gen split related to the spark fraction \ User Memory?

答案1

得分: 3

简短回答是，它们实际上没有太多关联，除了都与JVM堆有关。

更好的理解方式是将其分为四个桶（顺序无关紧要）：

年轻代中的Spark内存
老年代中的Spark内存
年轻代中的用户内存
老年代中的用户内存

（从技术上讲，还有一些既不属于Spark也不属于用户的系统内存，但通常这些内存很小，无需担心：这些内存也可以是老年代或年轻代的）。

对象是Spark还是用户内存是由Spark决定的（我实际上不知道这是否是永久性的分类，还是对象在这方面的分类可以改变）。

至于老年代与年轻代，这是由垃圾收集器管理的，垃圾收集器可以将对象从年轻代提升到老年代。在某些垃圾收集算法中，代的大小是动态调整的（或者它们使用固定大小的区域，给定区域可以是老年代或年轻代）。

您可以控制1+2、3+4、1+3和2+4的总容量，但您实际上无法（而且可能不太想）控制1、2、3或4的容量，因为在一个类别中使用多余空间来临时获取另一个类别中的更多空间会带来很多好处。

英文:

The short answer is that they're not really related beyond both having to do with the JVM heap.

The better way to think of this is that there are four buckets (numbered in no significant order):

Spark memory in the young gen
Spark memory in the old gen
User memory in the young gen
User memory in the old gen

(technically there's also some system memory that's neither Spark nor User, but this typically is small enough to not worry about: this can also be either old or young).

Whether an object is classed as Spark or User is decided by Spark (I actually don't know if this is an eternal designation or if objects can change their categorization in this respect).

As for old vs. young, this is managed by the garbage collector and the GC can and will promote objects from young to old. In some GC algorithms, the sizes of the generations are dynamically adjusted (or they use fixed size regions and a given region can be old or young).

You have control of aggregate capacity of 1+2, 3+4, 1+3, and 2+4, but you don't really have (and probably don't really want, because there's a lot of benefit to being able to use excess space in one category to getting more space temporarily in another) control over the capacity of 1, 2, 3, or 4.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark内存比例与Java堆分代（年轻代/老年代）的划分

问题

答案1

如何创建一个实现了3个接口的此类(Array)的数组？

Quarkus CLI选择了错误的Java版本用于开发模式。

从Zulu时间中提取日期，使用Joda Time。

Android深层链接会打开应用两次。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论