在Spark中,”partition”是什么意思?

huangapple go评论71阅读模式
英文:

What is partition in Spark?

问题

我正在努力理解,Spark 中的分区是什么?

我的理解是,当我们从源中读取数据并放入任何特定的“数据集”时,那个数据集可以被分割成多个“子数据集”,这些“子数据集”被称为分区,而它是由 Spark 框架决定在集群中的分发位置和方式。这理解正确吗?

当我阅读一些在线文章时,我产生了疑问,这些文章说:

在内部,这些“RDD 或数据集”存储在集群节点上的分区中。分区基本上是大型分布式数据集的逻辑块。

这个说法让我产生了困惑。根据上面的说法,“RDD 或数据集”位于分区内。但是我之前认为 RDD 本身是一个分区(在分割后)。

有人可以帮我解决这个疑惑吗?

这是我的代码片段,我正在从 JSON 中读取数据。

Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
                .json(JsonPath);

所以在读取的同时,如何将其分割成多个分区?或者还有其他的方法吗?

英文:

I'm trying to understand, what is partition in Spark ?

My understanding is, When we read from a source and place into any specific Datatset, then that Data set can be split into multiple sub-Datasets, those sub-Datasets are called partitions And Its upto spark framework where and how it distributes in cluster. Is it correct ?

I came into a doubt, when I read some online articles, which says

> Under the hood, these RDDs or Datasets are stored in partitions on
> different cluster nodes. Partition basically is a logical chunk of a
> large distributed data set

This statment breaks my under standing. As per the above statment, RDDs or Datasets sits inside partition. But I was thinking RDD itself as a partition (after splitiing).

Can anyone help me to clear this doubt ?

Here is my code snippet, where am reading from JSON.

Dataset&lt;Row&gt; ds = spark.read().schema(Jsonreadystructure.SCHEMA)
				.json(JsonPath);

So while reading itself, how can I split this into multile partitions ? or Any other way around ?

答案1

得分: 5

什么是分区?

根据 Spark 文档,**在 Spark 中,分区是集群中节点上存储的数据的原子块(数据的逻辑分割)。**分区是 Apache Spark 中的并行性基本单元。在 Apache Spark 中,RDD/Dataframe/Dataset 是分区的集合。

因此,当您执行以下操作时:

Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
                    .json(JsonPath);

Spark 会读取您的源 JSON 数据,并创建一个(数据的逻辑分区),然后在集群中并行处理这些分区。

举个例子,用通俗的话来说...
假设您有一个任务,要将1 吨的小麦从一个地方搬到另一个地方,但您只有 1 名人员资源(类似于单个线程)来执行该任务。在这种情况下,可能会出现很多情况。

  1. 您的资源可能无法一次性移动如此巨大的重量(类似于没有足够的 CPU 或内存)。
  2. 如果它有能力(类似于高配置的机器),那么可能需要很长时间,而且可能会过度使用。
  3. 并且在执行负载传输时,您的资源无法在中途处理任何其他进程,等等......

如果您将1 吨的小麦分成 1 公斤的小块(类似于数据的逻辑分区),然后雇用更多人员,并要求您的资源进行移动。
现在对于他们来说,这要容易得多,您可以增加一些人员资源(类似于扩展集群),并且可以轻松快速地完成实际任务。

与上述方法类似,Spark 对数据进行逻辑划分,以便您可以使用集群资源并行处理数据,从而在最佳状态下完成任务,加快速度。

注意:RDD/Dataset 和 Dataframe 只是用于数据的逻辑分区的抽象表示。
RDD 和 Dataframe 中还有其他概念,我在示例中没有涵盖(即弹性和不可变性)。

如何将其分成多个分区?

您可以使用 repartition API 进一步分割分区:

spark.read().schema(Jsonreadystructure.SCHEMA)
                        .json(JsonPath).repartition(number)

您还可以使用 coalesce() API 减少分区。

英文:

What is Partition?

> As per spark documentation, A partition in spark is an atomic chunk of
> data (a logical division of data) stored on a node in the cluster
.
> Partitions are basic units of parallelism in Apache Spark. RDDs/Dataframe/Dataset in
> Apache Spark is a collection of partitions.

So, When you do

Dataset&lt;Row&gt; ds = spark.read().schema(Jsonreadystructure.SCHEMA)
				.json(JsonPath);

spark reads your source json data and create a (logical division on data which are paritions) and then process those partitions parallely in cluster.

For example in laymen terms...
If you have a task to move 1-ton load of wheat from one place to another place and you have only 1 men resource(similar to a single thread) to do that task.so there can be a lot of possibilities over here.
1)Your resource might not be able to move such a huge weight at a time. (similar to you don't have enough CPU or RAM)
2)If It is capable(similar to high conf machine) then It might take a huge time and It might have stressed out.
3) AND your resource can't process any other process in between when It is doing load transfer. and soon.....

what if you divide 1-ton load of wheat into 1kg wheat blocks(similar to logical partitions on data) and hire more men and then ask your resources to move.
now it is a lot easier for them and you can add few more men resources(similar to scaling up the cluster) and can achieve your actual task very easily and fast.

similar to the above approach spark does a logical division on data so that you can process data parallelly using your cluster resources optimally and can finish your task much faster.

Note: RDD/Dataset and Dataframe are just abstractions for logical partitions of data.
and there other concepts in RDD and Dataframe which I didn't cover in the example (i.e Resilient and immutability)

How can I split this into multiple partitions ?

you can use repartition API to split furthermore partitions

spark.read().schema(Jsonreadystructure.SCHEMA)
                    .json(JsonPath).**repartition**(number)

and you can use coalesce() api to bring down partitions.

huangapple
  • 本文由 发表于 2020年9月16日 12:50:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/63913326.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定