在DataBricks中,将一个R数据框转换为Spark数据框是否有大小限制?

huangapple go评论84阅读模式
英文:

Is there a size limit in DataBricks for converting an R dataframe to a Spark dataframe?

问题

我是新来的Stack Overflow用户,尝试了很多方法来解决错误,但都没有成功。我的问题是:我可以将R数据框的子集转换为Spark数据框,但无法将整个数据框转换。类似的问题包括:

https://stackoverflow.com/questions/41823116/not-able-to-to-convert-r-data-frame-to-spark-dataframe/41823982

https://stackoverflow.com/questions/38696047/is-there-any-size-limit-for-spark-dataframe-to-process-hold-columns-at-a-time

这里是关于R数据框的一些信息:

library(SparkR)
sparkR.session()
sparkR.version()
[1] "2.4.3"

dim(df)
[1] 101368     25
class(df)
[1] "data.frame"

当将其转换为Spark数据框时:

sdf <- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) : 
Error in handleErrors(returnStatus, conn) : 

然而,当我对R数据框进行子集操作时,不会出现错误:

sdf_sub1 <- as.DataFrame(df[c(1:50000), ])
sdf_sub2 <- as.DataFrame(df[c(50001:101368), ])

class(sdf_sub1)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

class(sdf_sub2)
[1] "SparkDataFrame"
attr(,"package")
[1] "SparkR"

我如何将整个数据框写入Spark数据框?(我之后想要保存为表)。我在考虑容量方面的问题,但我不知道如何解决它。

非常感谢!

英文:

I am new to Stack overflow and tried so many ways to solve the error but without any success. My problem: I CAN convert subsets of an R dataframe to a Spark dataframe, but not the whole dataframe. Similar questions but not the same include:
https://stackoverflow.com/questions/41823116/not-able-to-to-convert-r-data-frame-to-spark-dataframe/41823982 and
https://stackoverflow.com/questions/38696047/is-there-any-size-limit-for-spark-dataframe-to-process-hold-columns-at-a-time

Here some information about the R dataframe:

library(SparkR)
sparkR.session()
sparkR.version()
[1] &quot;2.4.3&quot;

dim(df)
[1] 101368     25
class(df)
[1] &quot;data.frame&quot;

When converting this to a Spark Dataframe:

sdf &lt;- as.DataFrame(df)
Error in handleErrors(returnStatus, conn) : Error in handleErrors(returnStatus, conn) : 
Error in handleErrors(returnStatus, conn) : 

However, when I subset the R dataframe, it does NOT result in an error:

sdf_sub1 &lt;- as.DataFrame(df[c(1:50000), ])
sdf_sub2 &lt;- as.DataFrame(df[c(50001:101368), ])

class(sdf_sub1)
[1] &quot;SparkDataFrame&quot;
attr(,&quot;package&quot;)
[1] &quot;SparkR&quot;

class(sdf_sub2)
[1] &quot;SparkDataFrame&quot;
attr(,&quot;package&quot;)
[1] &quot;SparkR&quot;

How can I write the whole dataframe to a Spark DataFrame? (I want to saveAsTable afterwards).
I was thinking about a problem with capacity but I do not have a clue how to solve it.

Thanks a lot!!

答案1

得分: 2

一般情况下,将R数据框转换为Spark数据框,或者反过来,性能会较差。Spark和R中的对象在内存中的表示方式不同,从一个转换到另一个时对象的大小会显著增加。这通常会使驱动程序的内存超限,从而难以复制/收集大型对象到/从Spark。幸运的是,您有几个选项。

  1. 使用Apache Arrow建立对象的通用内存格式,消除了从R到Spark的复制和转换的需求。我提供的链接上有关于如何在Databricks上设置此项的说明。

  2. 将数据框写入磁盘为parquet(或CSV),然后直接在Spark中读取它。您可以使用R中的arrow来执行此操作。

  3. 增加驱动程序节点的大小以适应内存扩展。在Databricks上,您可以选择集群的驱动程序节点类型(或要求管理员执行此操作),确保选择具有大量内存的节点。作为参考,我测试过收集一个2GB的数据集,需要一个30GB+的驱动程序。有了Arrow,这个需求大大降低。

英文:

In general you'll see poor performance when converting from R dataframes to Spark dataframes, and vice versa. Objects are representing differently in memory in Spark and R, and there is significant expansion of the object size when converting from one to the other. This often blows out the memory of the driver, making it difficult to copy/collect large objects to/from Spark. Fortunately, you have a couple options.

  1. Use Apache Arrow to establish a common in memory format for objects, eliminating the need to copy and convert from representation in R to Spark. The link I provided has instructions on how to set this up on Databricks.

  2. Write the dataframe to disk as parquet (or CSV) and then read it into Spark directly. You can use the arrow library in R to do this.

  3. Increase the size of your driver node to accommodate the memory expansion. On Databricks you can select the driver node type (or ask your admin to do it) for your cluster - make sure you pick one with a lot of memory. For reference, I tested collecting a 2GB dataset and needed a 30GB+ driver. With arrow that comes down dramatically.

答案2

得分: 1

有关SparkR将DataFrame转换为data.table的表格大小限制,根据经验,这个限制取决于内存。实际限制比我预期的要小得多,大约为50,000行。

我不得不将一些非常大的data.tables转换为DataFrames,并最终创建了一个脚本,将它们分成更小的片段以绕过此限制。最初,我选择将数据分成n行的块,但当转换一个非常宽的表格时,会出现错误。我的解决办法是限制正在转换的元素的数量。

英文:

Anecdotally, there is a limit on the size of table that SparkR will convert from DataFrame to data.table that is memory-dependent. It is also far smaller than I would have expected, around 50,000 rows for my work

I had to convert some very large data.tables to DataFrames and ended up making a script to chunk them into smaller pieces to get around this. Initially I chose to chunk n rows of the data, but when a very wide table was converted this error returned. My work-around was to have a limit to the number of elements being converted.

huangapple
  • 本文由 发表于 2020年1月3日 22:27:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580295.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定