2023年6月13日 12:41:56go评论127阅读模式

英文:

Will Spark be able to perform "order By" on a dataframe? If the size of dataframe is larger than the executors memory

问题

我对Spark中的"order By"子句有一个疑问。我不确定在执行"order By"操作时，如果数据框大于一个执行器和驱动程序的内存，会发生什么情况。我在线搜索了一下，但没有得到令人满意的结果，也没有灵活性来自己测试它。有人能帮助我吗？我知道，当我们在DataFrame上执行"order by"操作时，Spark会执行全局排序操作，这涉及到对数据进行分区和重新分配。这意味着数据被分成多个分区，每个分区都会被单独排序。然后，Spark将排序后的分区合并以生成最终排序的输出。要实现这一点，Spark需要将所有分区收集到一个执行器中。现在，如果执行器的内存小于总数据量，Spark是否能够对其进行排序？它会使用溢出来执行"order by"操作吗？还是会抛出OOM错误？我没有灵活性来自己尝试这个。所以，你能帮助我吗？

英文:

I had one doubt regarding spark "order By" clause. I am not sure what will happen if we perform
"order By" in a dataframe which is larger than one executor and driver memory. I searched it online
but i didn't get satisfacory result and i do not the flexibility to check it myself.
Can someone please help me on this? I know that,
When we execute an "order by" operation on a DataFrame, Spark performs a global sort operation,
which involves shuffling and redistributing data across multiple partitions.
This means that the data is divided into partitions, and each partition is sorted
individually. Then, Spark merges the sorted partitions to produce the final sorted output.

For this to happen spark needs to collect all the partition into one executer. Now,
if the executers memory is lesser than the total data. will spark be able to order it?
will it use spill to do the "order by" operation? or it will throw us OOM error.

i didn't have flexibility to try this on my own. So, can you please help me with this?

答案1

得分: 0

是的，如果DataFrame的大小超过单个执行器内存，Spark仍然可以对其进行排序。这是Databricks的一篇文章，描述了在其中对一百万兆字节数据进行排序的Spark基准测试。

您已经正确描述了Spark对数据进行洗牌并进行本地排序的过程。但除非您要求它这样做，例如在DataFrame上调用collect()，否则它不会将排序分区收集到一个执行器中。排序后的数据可以按正确的顺序写入存储，方法是按正确的顺序写入各个已排序分区。Spark知道这个顺序，因为它是在分布式排序的执行过程中确定的。

英文:

Yes, Spark will still be able to sort a DataFrame if its size exceeds a single executor memory. Here is a post from Databricks describing a Spark benchmark during which a Petabyte of data was sorted.

You have correctly described that Spark shuffles the data and does local sorting. But it does not collect the sorted partitions in one executor unless you tell it to do so by, for example, calling collect() on the DataFrame. The sorted data can be written to a storage in the correct order by writing the individual sorted partitions in the correct order. Spark knows this order, because it is determined during the execution of the distributed sorting.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Will Spark be able to perform "order By" on a dataframe? If the size of dataframe is larger than the executors memory

问题

答案1

Typescript按字符串属性值排序

如何在Java/Scala Spark项目中使用PySpark UDF

PySpark将DataFrame写入S3需要很长时间。

高效地将概率列表转换为0/1列表，只需取最高概率的一部分%而无需重新索引

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论