Will Spark be able to perform "order By" on a dataframe? If the size of dataframe is larger than the executors memory

huangapple go评论50阅读模式
英文:

Will Spark be able to perform "order By" on a dataframe? If the size of dataframe is larger than the executors memory

问题

我对Spark中的"order By"子句有一个疑问。我不确定在执行"order By"操作时,如果数据框大于一个执行器和驱动程序的内存,会发生什么情况。我在线搜索了一下,但没有得到令人满意的结果,也没有灵活性来自己测试它。有人能帮助我吗?我知道,当我们在DataFrame上执行"order by"操作时,Spark会执行全局排序操作,这涉及到对数据进行分区和重新分配。这意味着数据被分成多个分区,每个分区都会被单独排序。然后,Spark将排序后的分区合并以生成最终排序的输出。要实现这一点,Spark需要将所有分区收集到一个执行器中。现在,如果执行器的内存小于总数据量,Spark是否能够对其进行排序?它会使用溢出来执行"order by"操作吗?还是会抛出OOM错误?我没有灵活性来自己尝试这个。所以,你能帮助我吗?

英文:

I had one doubt regarding spark "order By" clause. I am not sure what will happen if we perform
"order By" in a dataframe which is larger than one executor and driver memory. I searched it online
but i didn't get satisfacory result and i do not the flexibility to check it myself.
Can someone please help me on this? I know that,
When we execute an "order by" operation on a DataFrame, Spark performs a global sort operation,
which involves shuffling and redistributing data across multiple partitions.
This means that the data is divided into partitions, and each partition is sorted
individually. Then, Spark merges the sorted partitions to produce the final sorted output.

For this to happen spark needs to collect all the partition into one executer. Now,
if the executers memory is lesser than the total data. will spark be able to order it?
will it use spill to do the "order by" operation? or it will throw us OOM error.

i didn't have flexibility to try this on my own. So, can you please help me with this?

答案1

得分: 0

是的,如果DataFrame的大小超过单个执行器内存,Spark仍然可以对其进行排序。这是Databricks的一篇文章,描述了在其中对一百万兆字节数据进行排序的Spark基准测试。

您已经正确描述了Spark对数据进行洗牌并进行本地排序的过程。但除非您要求它这样做,例如在DataFrame上调用collect(),否则它不会将排序分区收集到一个执行器中。排序后的数据可以按正确的顺序写入存储,方法是按正确的顺序写入各个已排序分区。Spark知道这个顺序,因为它是在分布式排序的执行过程中确定的。

英文:

Yes, Spark will still be able to sort a DataFrame if its size exceeds a single executor memory. Here is a post from Databricks describing a Spark benchmark during which a Petabyte of data was sorted.

You have correctly described that Spark shuffles the data and does local sorting. But it does not collect the sorted partitions in one executor unless you tell it to do so by, for example, calling collect() on the DataFrame. The sorted data can be written to a storage in the correct order by writing the individual sorted partitions in the correct order. Spark knows this order, because it is determined during the execution of the distributed sorting.

huangapple
  • 本文由 发表于 2023年6月13日 12:41:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76461736.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定