2023年8月10日 15:51:11go评论139阅读模式

英文:

How to get Parquet row groups stats sorted across multiple files with Pyspark?

问题

你可以尝试使用repartition方法来改变数据分区的分布，从而达到你想要的效果。例如：

df = df.repartition(2, 'longColumn')

这将会根据longColumn列重新分区成两个分区，这可能会改变统计信息的分布情况。你可以根据需要调整分区的数量，以达到你想要的结果。

英文:

My process is creating as outcome multiple parquet files. In my specific case 64 as the number of final partitions. Single file is around ~400MB divided by 3 row groups of ~128MB each one. If I look at single file, every row group has own stats min/max which is well distributed per column. For example, given a long column that has for one file (part-00000-*):

- group 1 - min:102 max: 9992
- group 2 - min:9994 max: 21990
- group 3 - min:22098 max: 35764

and the same column on the next file (part-00001-*) with similar distribution

- group 1 - min:99 max: 9882
- group 2 - min:9980 max: 21979
- group 3 - min:22018 max: 32764

for a total of 2 files.
What I need is having a distribution like

part-00000-*:
- group 1 - min:99 max: 8662
- group 2 - min:8994 max: 13986
- group 3 - min:14333 max: 19845
part-00001-*
- group 1 - min:19877 max: 25621
- group 2 - min:25654 max: 30091
- group 3 - min:31094 max: 35764

How can I get that when writing files in PySpark? I know I can simply sort by df.orderBy('longColumn') but that affects performance too much so I am looking for an alternative.

答案1

得分: 1

假设你的longColumn中的值分布比较均匀，也许可以采用repartitionByRange接着是sortWithinPartitions来优化策略。

df.
  repartitionByRange(64, "longColumn").
  sortWithinPartitions("longColumn", ascending=True).
  write.parquet("myFile.parquet")

这样，你就能避免在一个执行器上收集和排序整个数据框。

英文:

Assuming more or less uniform distribution of values within your longColumn, perhaps a repartitionByRange followed by sortWithinPartitions would be a better strategy.

df.
  repartitionByRange(64,&quot;longColumn&quot;).
  sortWithinPartitions(&quot;longColumn&quot;,ascending=True).
  write.parquet(&quot;myFile.parquet&quot;)

This way, you'll avoid collecting and sorting the whole dataframe on one executor.

答案2

得分: 0

你可以通过确保在写入文件之前对该列进行全局排序来实现这一点。假设你的数据框名为 df，你的长列名为 longColumn。

然后你可以这样做：

df.orderBy("longColumn").write.parquet("myFile.parquet")

如果你接着检查 longColumn 中的行组的最小/最大值，你会发现你所描述的内容。

英文:

You can achieve this by making sure you have a global ordering on that column before you write your file. Let's say your dataframe is called df and your long column is called longColumn.

Then you do something like:

df.orderBy(&quot;longColumn&quot;).write.parquet(&quot;myFile.parquet&quot;)

If you then check the min/max values for your row groups in longColumn you will see you have what you just described.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Pyspark获取跨多个文件排序的Parquet行组统计信息？

问题

答案1

答案2

Apache Spark spark-submit k8s API https ERROR

In spark dataframe add columns to from one df to another without creating combination of matching rows

用Spark Java API从映射中创建数据帧。

你可以在Windows上使用预构建的Spark而无需Hadoop吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。