英文:
Handling of ties in row_number in Pyspark vs SQL
问题
以下是已翻译的内容:
I have a table containing following columns:
year, subject, marks, city
Let's say it contains following values:
year subject marks student name
2023 Maths 91 Jon
2023 Maths 71 Dany
2023 Maths 71 Rob
2023 Maths 85 Joffery
Lets say i perform
.withColumn('Ranking', f.row_number().over(
Window.partitionBy('year', 'subject').orderBy(f.col('marks').asc())))
As Dany and Rob have same marks, who will get 'Ranking' 1
Will it yeild different result if i run multiple times?
How such "ties" are handled in PySpak vs redshift SQL
英文:
I have a table containing following columns:
year, subject, marks, city
Let's say it contains following values:
year subject marks student name
2023 Maths 91 Jon
2023 Maths 71 Dany
2023 Maths 71 Rob
2023 Maths 85 Joffery
Lets say i perform
.withColumn('Ranking', f.row_number().over(\
Window.partitionBy('year', 'subject').orderBy(f.col('marks').asc())))
As Dany and Rob have same marks, who will get 'Ranking' 1
Will it yeild different result if i run multiple times?
How such "ties" are handled in PySpak vs redshift SQL
答案1
得分: 1
在Apache Spark中,如果行号和排序值相同,则行的顺序不被保证。当排序值相同时,Spark不提供确定性的顺序。
这种行为的原因是Spark的处理是分布在多个节点上的,数据处理和聚合的顺序在不同的分区或节点之间不保证一致。因此,当多行具有相同的排序值时,它们的相对顺序可能在不同的查询执行或运行中发生变化。
如果您需要对具有相同排序值的行保持一致的顺序,您应该在排序子句中包括额外的列以打破关系。通过指定额外的列,您可以确保确定性的顺序。例如:
.withColumn('Ranking', f.row_number().over(
Window.partitionBy('year', 'subject').orderBy(f.col('marks'), f.col('name'))))
通过在排序子句中包括额外的列,您可以实现具有相同排序值的行的明确定义的顺序。
英文:
In Apache Spark, if the row number and order by value are the same, the order of the rows is not guaranteed. Spark does not provide a deterministic order when the order by values are identical.
The reason for this behavior is that Spark's processing is distributed across multiple nodes, and the order of data processing and aggregation is not guaranteed to be consistent across different partitions or nodes. As a result, when multiple rows have the same order by value, their relative order might vary between different executions or runs of the same query.
If you need a consistent order for rows with the same order by value, you should include additional columns in the order by clause to break the tie. By specifying additional columns, you can ensure a deterministic order. For example:
.withColumn('Ranking', f.row_number().over(\
Window.partitionBy('year', 'subject').orderBy(f.col('marks'), f.col('name'))))
By including an additional column(s) in the order by clause, you can achieve a well-defined order for rows with the same order by value.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论