2023年2月16日 15:44:59go评论118阅读模式

英文:

How Repartitioning of a data frame on frequently used filter column can be helpful in Spark?

问题

部分翻译如下：

"我一直在各处观看关于Spark数据框的重新分区和合并的材料。有些人说，如果在筛选的列上进行重新分区可以提高性能，我不明白为什么会这样。我知道我的问题不够具体，因为我观看的视频没有详细说明，而且我也没有得到任何回应。

这是因为筛选会导致分区变少吗？

欢迎任何见解。"

英文:

I have been watching some materials here and there about repartitioning and coalesce of Spark data frames. Some said the repartitioning can improve performance if it is done on a filtered column. I don't understand why it is so. I know my question isn't specific since the video I watched didn't elaborate and I couldn't get any responses.

Is that because filtering will result in fewer partitions?

Any insight will be welcomed.

答案1

得分: 1

无论你正在进行何种转换，如果你在一个分区列上进行该转换，它都会更快，因为它允许机器知道每个数据位于何处。因此，过滤速度会更快，因为不需要扫描所有数据，它可以直接“删除”或“选择”只有你感兴趣的行。

英文:

No matter what transformation you are doing, if you do that transformation on a partitionned column, it will be faster because it allows the machine to know where each data is. Therefore, the filtering will be faster because it does not have to scan all you data. It can directly "delete" or "select" only the rows you are interested in.

答案2

得分: 0

我不建议这样做。例如，

假设在初始数据集上，你对一个列进行重新分区，然后在同一列上进行筛选。这将创建一种情况，许多分区在筛选后将没有记录。随着你的流水线继续前进，这些分区的任务将在0-0.2秒内结束，而少数任务（映射到具有数据的分区）实际上会处理数据，从而使整个流水线变慢。

相反，我希望筛选后的数据存在于所有分区中，以便我利用所有执行器核心来处理数据。

例如；

“数据集”

“id|category|分区”
“01|A |1”
“02|A |1”
“03|B |1”
“04|B |1”
“05|C |2”
“06|C |2”
“07|D |2”
“08|D |2”
“09|E |2”
“10|E |2”

假设你现在对“category”进行重新分区

“id|category|分区”
“01|A |1”
“02|A |1”
“03|B |2”
“04|B |2”
“05|C |3”
“06|C |3”
“07|D |4”
“08|D |4”
“09|E |5”
“10|E |5”

现在，当你筛选“category=2”时，大多数分区中将没有记录。这些分区进一步在你的流水线中的任务实际上不做任何工作，这些执行器核心被浪费。

希望这有意义。

英文:

I wouldn't recommend it. For an instance,

Assume that, on an initial dataset, you repartition on a column and then filter on the same column. This will create a situation where many partitions will then have 0 records after the filter. As your pipeline moves ahead, the tasks for these partitions will end within 0-0.2 secs while a small number of tasks (mapping to the partitions with data) will actually work on the data making the whole pipeline slower.

Rather, I would like the filtered data to be present in all the partitions so that I utilise all the executor cores working on the data.

E.g.;

Dataset

id|category|partition
01|A |1
02|A |1
03|B |1
04|B |1
05|C |2
06|C |2
07|D |2
08|D |2
09|E |2
10|E |2

Assume that you now repartition on category

id|category|partition
01|A |1
02|A |1
03|B |2
04|B |2
05|C |3
06|C |3
07|D |4
08|D |4
09|E |5
10|E |5

Now, when you filter category=2, you then have 0 records in most partitions. Tasks further in your pipeline on these partitions essentially do not do any work and these executor cores are wasted.

Hope this makes sense.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据帧在经常使用的筛选列上重新分区如何在Spark中有所帮助？

问题

答案1

答案2

Apache Spark spark-submit k8s API https ERROR

如何使用Pyspark获取跨多个文件排序的Parquet行组统计信息？

Hive 3.1.2中的UDAF在Spark 3.0.0中不起作用。

将RDD根据某个值拆分成不同的RDD，并将每个部分传递给一个函数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论