数据帧在经常使用的筛选列上重新分区如何在Spark中有所帮助?

huangapple go评论55阅读模式
英文:

How Repartitioning of a data frame on frequently used filter column can be helpful in Spark?

问题

部分翻译如下:

"我一直在各处观看关于Spark数据框的重新分区和合并的材料。有些人说,如果在筛选的列上进行重新分区可以提高性能,我不明白为什么会这样。我知道我的问题不够具体,因为我观看的视频没有详细说明,而且我也没有得到任何回应。

这是因为筛选会导致分区变少吗?

欢迎任何见解。"

英文:

I have been watching some materials here and there about repartitioning and coalesce of Spark data frames. Some said the repartitioning can improve performance if it is done on a filtered column. I don't understand why it is so. I know my question isn't specific since the video I watched didn't elaborate and I couldn't get any responses.

Is that because filtering will result in fewer partitions?

Any insight will be welcomed.

答案1

得分: 1

无论你正在进行何种转换,如果你在一个分区列上进行该转换,它都会更快,因为它允许机器知道每个数据位于何处。因此,过滤速度会更快,因为不需要扫描所有数据,它可以直接“删除”或“选择”只有你感兴趣的行。

英文:

No matter what transformation you are doing, if you do that transformation on a partitionned column, it will be faster because it allows the machine to know where each data is. Therefore, the filtering will be faster because it does not have to scan all you data. It can directly "delete" or "select" only the rows you are interested in.

答案2

得分: 0

我不建议这样做。例如,

假设在初始数据集上,你对一个列进行重新分区,然后在同一列上进行筛选。这将创建一种情况,许多分区在筛选后将没有记录。随着你的流水线继续前进,这些分区的任务将在0-0.2秒内结束,而少数任务(映射到具有数据的分区)实际上会处理数据,从而使整个流水线变慢。

相反,我希望筛选后的数据存在于所有分区中,以便我利用所有执行器核心来处理数据。

例如;

“数据集”

“id|category|分区”
“01|A |1”
“02|A |1”
“03|B |1”
“04|B |1”
“05|C |2”
“06|C |2”
“07|D |2”
“08|D |2”
“09|E |2”
“10|E |2”

假设你现在对“category”进行重新分区

“id|category|分区”
“01|A |1”
“02|A |1”
“03|B |2”
“04|B |2”
“05|C |3”
“06|C |3”
“07|D |4”
“08|D |4”
“09|E |5”
“10|E |5”

现在,当你筛选“category=2”时,大多数分区中将没有记录。这些分区进一步在你的流水线中的任务实际上不做任何工作,这些执行器核心被浪费。

希望这有意义。

英文:

I wouldn't recommend it. For an instance,

Assume that, on an initial dataset, you repartition on a column and then filter on the same column. This will create a situation where many partitions will then have 0 records after the filter. As your pipeline moves ahead, the tasks for these partitions will end within 0-0.2 secs while a small number of tasks (mapping to the partitions with data) will actually work on the data making the whole pipeline slower.

Rather, I would like the filtered data to be present in all the partitions so that I utilise all the executor cores working on the data.

E.g.;

Dataset

id|category|partition
01|A |1
02|A |1
03|B |1
04|B |1
05|C |2
06|C |2
07|D |2
08|D |2
09|E |2
10|E |2

Assume that you now repartition on category

id|category|partition
01|A |1
02|A |1
03|B |2
04|B |2
05|C |3
06|C |3
07|D |4
08|D |4
09|E |5
10|E |5

Now, when you filter category=2, you then have 0 records in most partitions. Tasks further in your pipeline on these partitions essentially do not do any work and these executor cores are wasted.

Hope this makes sense.

huangapple
  • 本文由 发表于 2023年2月16日 15:44:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/75469172.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定