2023年7月31日 21:08:49go评论171阅读模式

英文:

Output Parquet file is very big in size after repartitioning with column in Spark

问题

我尝试根据列重新分区的数据框生成一个超过500MB大小的单个文件。

df.repartition(col("column_name")).write.parquet("gs://path_of_bucket")

有没有一种方法可以将输出的Parquet文件大小限制为128MB？我不想使用分区数量，因为输出可以每小时变化。我正在使用Dataproc集群，输出将进入GCS存储桶。

英文:

The dataframe which I am trying to repartition based on column is generating a single file of more than 500MB size.

df.repartition(col(&quot;column_name&quot;)).write.parquet(&quot;gs://path_of_bucket&quot;)

Is their a way to limit the size of output parquet file to 128MB? I don't want to use number of partitions as output can vary hourly. I am using dataproc cluster and output is going into GCS bucket.

答案1

得分: 3

您可以使用spark.sql.files.maxRecordsPerFile 将要写入的数据框拆分为每个文件包含 X 行。

> 属性名称 spark.sql.files.maxRecordsPerFile 默认值 0 
> 含义单个文件写出的最大记录数。如果该值为零或负数，则没有限制。 自版本 2.2.0

如果您的行大致相等长度，您可以估算出满足您所需大小（128MB）的 X 的数量。

英文:

You can use spark.sql.files.maxRecordsPerFile to split dataframe being written into files of X rows each.

> Property Name spark.sql.files.maxRecordsPerFile Default 0 
> Meaning Maximum number of records to write out to a single file. If
> this value is zero or negative, there is no limit. Since Version 2.2.0

If your rows are more or less uniform in length, you can estimate the number X that would give your desired size (128MB).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

输出的Parquet文件在使用Spark中的列重新分区后非常大。

问题

答案1

Pyspark Parquet – 重分区后排序

如何在PySpark中调用Spark Java UDF而不使用SQL？

比较在一个分组内的所有行的 PySpark 数据框。

如何在PySpark中将DataFrame进行转换？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论