2023年6月8日 14:45:38go评论133阅读模式

英文:

Handling of ties in row_number in Pyspark vs SQL

问题

以下是已翻译的内容：

I have a table containing following columns:
year, subject, marks, city

Let's say it contains following values:

year subject marks student name
2023 Maths 91 Jon
2023 Maths 71 Dany
2023 Maths 71 Rob
2023 Maths 85 Joffery

Lets say i perform

.withColumn('Ranking', f.row_number().over(
Window.partitionBy('year', 'subject').orderBy(f.col('marks').asc())))

As Dany and Rob have same marks, who will get 'Ranking' 1

Will it yeild different result if i run multiple times?

How such "ties" are handled in PySpak vs redshift SQL

英文:

I have a table containing following columns:
year, subject, marks, city

Let's say it contains following values:

year  subject marks student name
2023  Maths   91    Jon
2023  Maths   71    Dany
2023  Maths   71    Rob
2023  Maths   85    Joffery

Lets say i perform

.withColumn(&#39;Ranking&#39;, f.row_number().over(\
        Window.partitionBy(&#39;year&#39;, &#39;subject&#39;).orderBy(f.col(&#39;marks&#39;).asc())))

As Dany and Rob have same marks, who will get 'Ranking' 1

Will it yeild different result if i run multiple times?

How such "ties" are handled in PySpak vs redshift SQL

答案1

得分: 1

在Apache Spark中，如果行号和排序值相同，则行的顺序不被保证。当排序值相同时，Spark不提供确定性的顺序。

这种行为的原因是Spark的处理是分布在多个节点上的，数据处理和聚合的顺序在不同的分区或节点之间不保证一致。因此，当多行具有相同的排序值时，它们的相对顺序可能在不同的查询执行或运行中发生变化。

如果您需要对具有相同排序值的行保持一致的顺序，您应该在排序子句中包括额外的列以打破关系。通过指定额外的列，您可以确保确定性的顺序。例如：

.withColumn('Ranking', f.row_number().over(
        Window.partitionBy('year', 'subject').orderBy(f.col('marks'), f.col('name'))))

通过在排序子句中包括额外的列，您可以实现具有相同排序值的行的明确定义的顺序。

英文:

In Apache Spark, if the row number and order by value are the same, the order of the rows is not guaranteed. Spark does not provide a deterministic order when the order by values are identical.

The reason for this behavior is that Spark's processing is distributed across multiple nodes, and the order of data processing and aggregation is not guaranteed to be consistent across different partitions or nodes. As a result, when multiple rows have the same order by value, their relative order might vary between different executions or runs of the same query.

If you need a consistent order for rows with the same order by value, you should include additional columns in the order by clause to break the tie. By specifying additional columns, you can ensure a deterministic order. For example:


.withColumn(&#39;Ranking&#39;, f.row_number().over(\
        Window.partitionBy(&#39;year&#39;, &#39;subject&#39;).orderBy(f.col(&#39;marks&#39;), f.col(&#39;name&#39;))))

By including an additional column(s) in the order by clause, you can achieve a well-defined order for rows with the same order by value.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark 和 SQL 中 row_number 函数中的并列情况处理方式的比较

问题

答案1

如何找到在相同的主键内，列A的值不同但列B的值相同。

使用ADO代码将UDF添加到Pervasive数据库。

如何在相同产品的标志为真时对一列进行求和

SQL – 从多个表中总结结果

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论