2020年1月3日 22:27:57go评论115阅读模式

英文:

Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

问题

我是新来的Stack Overflow用户，尝试了很多方法来解决错误，但都没有成功。我的问题是：我可以将R数据框的子集转换为Spark数据框，但无法将整个数据框转换。类似的问题包括：

https://stackoverflow.com/questions/41823116/not-able-to-to-convert-r-data-frame-to-spark-dataframe/41823982 和

https://stackoverflow.com/questions/38696047/is-there-any-size-limit-for-spark-dataframe-to-process-hold-columns-at-a-time

这里是关于R数据框的一些信息：

library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"
dim(df)
[1] 101368     25
class(df)
[1] "data.frame"

当将其转换为Spark数据框时：

sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) : 
Error in handleErrors(returnStatus, conn) :

然而，当我对R数据框进行子集操作时，不会出现错误：

sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])
class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"
class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

我如何将整个数据框写入Spark数据框？（我之后想要保存为表）。我在考虑容量方面的问题，但我不知道如何解决它。

非常感谢！

英文:

I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include:
https://stackoverflow.com/questions/41823116/not-able-to-to-convert-r-data-frame-to-spark-dataframe/41823982 and
https://stackoverflow.com/questions/38696047/is-there-any-size-limit-for-spark-dataframe-to-process-hold-columns-at-a-time

Here some information about the R dataframe:

library(SparkR)
sparkR.session()
sparkR.version()
[1] &quot;2.4.3&quot;
dim(df)
[1] 101368     25
class(df)
[1] &quot;data.frame&quot;

When converting this to a Spark Dataframe:

sdf &lt;- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) : 
Error in handleErrors(returnStatus, conn) :

However, when I subset the R dataframe, it does NOT result in an error:

sdf_sub1 &lt;- as.DataFrame(df[c(1:50000), ])
sdf_sub2 &lt;- as.DataFrame(df[c(50001:101368), ])
class(sdf_sub1)
[1] &quot;SparkDataFrame&quot;
attr(,&quot;package&quot;)
[1] &quot;SparkR&quot;
class(sdf_sub2)
[1] &quot;SparkDataFrame&quot;
attr(,&quot;package&quot;)
[1] &quot;SparkR&quot;

How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards).
I was thinking about a problem with capacity but I do not have a clue how to solve it.

Thanks a lot!!

答案1

得分: 2

一般情况下，将R数据框转换为Spark数据框，或者反过来，性能会较差。Spark和R中的对象在内存中的表示方式不同，从一个转换到另一个时对象的大小会显著增加。这通常会使驱动程序的内存超限，从而难以复制/收集大型对象到/从Spark。幸运的是，您有几个选项。

使用Apache Arrow建立对象的通用内存格式，消除了从R到Spark的复制和转换的需求。我提供的链接上有关于如何在Databricks上设置此项的说明。
将数据框写入磁盘为parquet（或CSV），然后直接在Spark中读取它。您可以使用R中的arrow库来执行此操作。
增加驱动程序节点的大小以适应内存扩展。在Databricks上，您可以选择集群的驱动程序节点类型（或要求管理员执行此操作），确保选择具有大量内存的节点。作为参考，我测试过收集一个2GB的数据集，需要一个30GB+的驱动程序。有了Arrow，这个需求大大降低。

英文:

In general you'll see poor performance when converting from R dataframes to Spark dataframes, and vice versa. Objects are representing differently in memory in Spark and R, and there is significant expansion of the object size when converting from one to the other. This often blows out the memory of the driver, making it difficult to copy/collect large objects to/from Spark. Fortunately, you have a couple options.

Use Apache Arrow to establish a common in memory format for objects, eliminating the need to copy and convert from representation in R to Spark. The link I provided has instructions on how to set this up on Databricks.
Write the dataframe to disk as parquet (or CSV) and then read it into Spark directly. You can use the arrow library in R to do this.
Increase the size of your driver node to accommodate the memory expansion. On Databricks you can select the driver node type (or ask your admin to do it) for your cluster - make sure you pick one with a lot of memory. For reference, I tested collecting a 2GB dataset and needed a 30GB+ driver. With arrow that comes down dramatically.

答案2

得分: 1

有关SparkR将DataFrame转换为data.table的表格大小限制，根据经验，这个限制取决于内存。实际限制比我预期的要小得多，大约为50,000行。

我不得不将一些非常大的data.tables转换为DataFrames，并最终创建了一个脚本，将它们分成更小的片段以绕过此限制。最初，我选择将数据分成n行的块，但当转换一个非常宽的表格时，会出现错误。我的解决办法是限制正在转换的元素的数量。

英文:

Anecdotally, there is a limit on the size of table that SparkR will convert from DataFrame to data.table that is memory-dependent. It is also far smaller than I would have expected, around 50,000 rows for my work

I had to convert some very large data.tables to DataFrames and ended up making a script to chunk them into smaller pieces to get around this. Initially I chose to chunk n rows of the data, but when a very wide table was converted this error returned. My work-around was to have a limit to the number of elements being converted.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在DataBricks中，将一个R数据框转换为Spark数据框是否有大小限制？

问题

答案1

答案2

根据行和列索引在R中高效地填充矩阵

元素在多个数据框之间的 R 模式

在数据框内有条件地将增长率应用于给定的初始值。

有dbplyr方法可以基于日期范围计算滚动总和吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。