英文:
Output Parquet file is very big in size after repartitioning with column in Spark
问题
我尝试根据列重新分区的数据框生成一个超过500MB大小的单个文件。
df.repartition(col("column_name")).write.parquet("gs://path_of_bucket")
有没有一种方法可以将输出的Parquet文件大小限制为128MB?我不想使用分区数量,因为输出可以每小时变化。我正在使用Dataproc集群,输出将进入GCS存储桶。
英文:
The dataframe which I am trying to repartition based on column is generating a single file of more than 500MB size.
df.repartition(col("column_name")).write.parquet("gs://path_of_bucket")
Is their a way to limit the size of output parquet file to 128MB? I don't want to use number of partitions as output can vary hourly. I am using dataproc cluster and output is going into GCS bucket.
答案1
得分: 3
您可以使用spark.sql.files.maxRecordsPerFile
将要写入的数据框拆分为每个文件包含 X
行。
> 属性名称 spark.sql.files.maxRecordsPerFile<br> 默认值 0<br>
> 含义 单个文件写出的最大记录数。如果该值为零或负数,则没有限制。<br> 自版本 2.2.0
如果您的行大致相等长度,您可以估算出满足您所需大小(128MB)的 X 的数量。
英文:
You can use spark.sql.files.maxRecordsPerFile
to split dataframe being written into files of X
rows each.
> Property Name spark.sql.files.maxRecordsPerFile<br> Default 0<br>
> Meaning Maximum number of records to write out to a single file. If
> this value is zero or negative, there is no limit.<br> Since Version 2.2.0
If your rows are more or less uniform in length, you can estimate the number X that would give your desired size (128MB).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论