Pyspark 和 SQL 中 row_number 函数中的并列情况处理方式的比较

huangapple go评论59阅读模式
英文:

Handling of ties in row_number in Pyspark vs SQL

问题

以下是已翻译的内容:

I have a table containing following columns:
year, subject, marks, city

Let's say it contains following values:

year subject marks student name
2023 Maths 91 Jon
2023 Maths 71 Dany
2023 Maths 71 Rob
2023 Maths 85 Joffery

Lets say i perform

.withColumn('Ranking', f.row_number().over(
Window.partitionBy('year', 'subject').orderBy(f.col('marks').asc())))

As Dany and Rob have same marks, who will get 'Ranking' 1

Will it yeild different result if i run multiple times?

How such "ties" are handled in PySpak vs redshift SQL

英文:

I have a table containing following columns:
year, subject, marks, city

Let's say it contains following values:

year  subject marks student name
2023  Maths   91    Jon
2023  Maths   71    Dany
2023  Maths   71    Rob
2023  Maths   85    Joffery

Lets say i perform

.withColumn('Ranking', f.row_number().over(\
        Window.partitionBy('year', 'subject').orderBy(f.col('marks').asc())))

As Dany and Rob have same marks, who will get 'Ranking' 1

Will it yeild different result if i run multiple times?

How such "ties" are handled in PySpak vs redshift SQL

答案1

得分: 1

在Apache Spark中,如果行号和排序值相同,则行的顺序不被保证。当排序值相同时,Spark不提供确定性的顺序。

这种行为的原因是Spark的处理是分布在多个节点上的,数据处理和聚合的顺序在不同的分区或节点之间不保证一致。因此,当多行具有相同的排序值时,它们的相对顺序可能在不同的查询执行或运行中发生变化。

如果您需要对具有相同排序值的行保持一致的顺序,您应该在排序子句中包括额外的列以打破关系。通过指定额外的列,您可以确保确定性的顺序。例如:

.withColumn('Ranking', f.row_number().over(
        Window.partitionBy('year', 'subject').orderBy(f.col('marks'), f.col('name'))))

通过在排序子句中包括额外的列,您可以实现具有相同排序值的行的明确定义的顺序。

英文:

In Apache Spark, if the row number and order by value are the same, the order of the rows is not guaranteed. Spark does not provide a deterministic order when the order by values are identical.

The reason for this behavior is that Spark's processing is distributed across multiple nodes, and the order of data processing and aggregation is not guaranteed to be consistent across different partitions or nodes. As a result, when multiple rows have the same order by value, their relative order might vary between different executions or runs of the same query.

If you need a consistent order for rows with the same order by value, you should include additional columns in the order by clause to break the tie. By specifying additional columns, you can ensure a deterministic order. For example:


.withColumn('Ranking', f.row_number().over(\
        Window.partitionBy('year', 'subject').orderBy(f.col('marks'), f.col('name'))))

By including an additional column(s) in the order by clause, you can achieve a well-defined order for rows with the same order by value.

huangapple
  • 本文由 发表于 2023年6月8日 14:45:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76429243.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定