Spark内存比例与Java堆分代(年轻代/老年代)的划分

huangapple go评论78阅读模式
英文:

Spark memory fraction vs Young Generation/Old Generation java heap split

问题

以下是您要翻译的内容:

我正在学习Spark,并且对执行程序内存划分有一些疑问。具体而言,在Spark Apache文档中(此处)指出:

> Java堆空间分为两个区域:年轻代和老年代。年轻代用于存放生命周期短的对象,而老年代用于存放生命周期较长的对象。

如下图所示:

Spark内存比例与Java堆分代(年轻代/老年代)的划分

但是对于Spark Executor,根据Spark Apache文档中的说明(此处)存在另一种内存的抽象划分:

> Spark中的内存使用主要分为两类:执行和存储。执行内存是用于洗牌、连接、排序和聚合等计算的内存,而存储内存是用于缓存和在集群中传播内部数据的内存。在Spark中,执行和存储共享一个统一的区域(M)。

如下图所示:

Spark内存比例与Java堆分代(年轻代/老年代)的划分

我不明白年轻代/老年代如何与存储/执行内存重叠,因为在同一份文档中(仍然在这里)指出:

> spark.memory.fraction将M的大小表示为(JVM堆空间 - 300MiB)的一部分(默认为0.6)。其余空间(40%)用于用户数据结构、Spark中的内部元数据以及在稀疏和异常大记录的情况下防止OOM错误。

其中spark.memory.fraction代表了Java堆的执行/存储内存部分。

但是
> 如果OldGen接近满了,请减少用于缓存的内存量,降低spark.memory.fraction;缓存较少的对象比减慢任务执行更好。

这似乎暗示了OldGen实际上是用户内存,但以下陈述似乎与我的假设相矛盾:

> 如果OldGen接近满了,或者考虑减小年轻代的大小。

我没有注意到什么?

年轻代/老年代如何与Spark分数/用户内存的划分相关?

英文:

I am studying Spark and I have some doubts regarding the Executor memory split. Specifically, in the Spark Apache documentation (here) is stated that:

> Java Heap space is divided in to two regions Young and Old. The Young
> generation is meant to hold short-lived objects while the Old
> generation is intended for objects with longer lifetimes.

this one:

Spark内存比例与Java堆分代(年轻代/老年代)的划分

But for the Spark Executor there is another abstract split for the memory, as stated by spark apache doc (here):

> Memory usage in Spark largely falls under one of two categories:
> execution and storage. Execution memory refers to that used for
> computation in shuffles, joins, sorts and aggregations, while storage
> memory refers to that used for caching and propagating internal data
> across the cluster. In Spark, execution and storage share a unified
> region (M).

As shown here:

Spark内存比例与Java堆分代(年轻代/老年代)的划分

I don't understand how Young Gen\Old gen are overlapped with storage\execution memory, because in the same doc (always here) is stated that:

> spark.memory.fraction expresses the size of M as a fraction of the
> (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%)
> is reserved for user data structures, internal metadata in Spark, and
> safeguarding against OOM errors in the case of sparse and unusually
> large records.

Where spark.memory.fraction represent the execution\storage memory part of the Java Heap

But
> If the OldGen is close to being full, reduce the amount of memory used
> for caching by lowering spark.memory.fraction; it is better to cache
> fewer objects than to slow down task execution.

This seems suggesting that the oldgen is in fact the User Memory, but the following statement seems to contradict my hypothesis

> If the OldGen is close to being full, alternatively, consider decreasing the size of the Young generation.

What am I no seeing?

How is Young Gen\Old Gen split related to the spark fraction \ User Memory?

答案1

得分: 3

简短回答是,它们实际上没有太多关联,除了都与JVM堆有关。

更好的理解方式是将其分为四个桶(顺序无关紧要):

  1. 年轻代中的Spark内存
  2. 老年代中的Spark内存
  3. 年轻代中的用户内存
  4. 老年代中的用户内存

(从技术上讲,还有一些既不属于Spark也不属于用户的系统内存,但通常这些内存很小,无需担心:这些内存也可以是老年代或年轻代的)。

对象是Spark还是用户内存是由Spark决定的(我实际上不知道这是否是永久性的分类,还是对象在这方面的分类可以改变)。

至于老年代与年轻代,这是由垃圾收集器管理的,垃圾收集器可以将对象从年轻代提升到老年代。在某些垃圾收集算法中,代的大小是动态调整的(或者它们使用固定大小的区域,给定区域可以是老年代或年轻代)。

您可以控制1+2、3+4、1+3和2+4的总容量,但您实际上无法(而且可能不太想)控制1、2、3或4的容量,因为在一个类别中使用多余空间来临时获取另一个类别中的更多空间会带来很多好处。

英文:

The short answer is that they're not really related beyond both having to do with the JVM heap.

The better way to think of this is that there are four buckets (numbered in no significant order):

  1. Spark memory in the young gen
  2. Spark memory in the old gen
  3. User memory in the young gen
  4. User memory in the old gen

(technically there's also some system memory that's neither Spark nor User, but this typically is small enough to not worry about: this can also be either old or young).

Whether an object is classed as Spark or User is decided by Spark (I actually don't know if this is an eternal designation or if objects can change their categorization in this respect).

As for old vs. young, this is managed by the garbage collector and the GC can and will promote objects from young to old. In some GC algorithms, the sizes of the generations are dynamically adjusted (or they use fixed size regions and a given region can be old or young).

You have control of aggregate capacity of 1+2, 3+4, 1+3, and 2+4, but you don't really have (and probably don't really want, because there's a lot of benefit to being able to use excess space in one category to getting more space temporarily in another) control over the capacity of 1, 2, 3, or 4.

huangapple
  • 本文由 发表于 2020年8月25日 00:45:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/63565290.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定