问题

我有一个带有以下格式的pyspark dataframe：

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2
2     C     10     1
2     C     12     2
3     D     11     1
4     E     12     1
4     E     13     2

目标是找到那些在排名1和2的情况下名称不相同的ID，并且如果名称相同或只有一个名称可用，则将其过滤掉。我认为可以通过创建另一个数据框，在ID上进行groupBy，然后计算名称的数量，左连接到这个数据框，并筛选出那些计数<2的行，但这太繁琐了，我想知道是否有更好的方法可以使用窗口函数来完成这个任务。因此，在过滤掉不需要的行后，上述内容将如下所示：

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2

英文:

I have a pyspark dataframe with the following format:

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2
2     C     10     1
2     C     12     2
3     D     11     1
4     E     12     1
4     E     13     2

goal is to find those ids where name is not the same for ranks 1 and 2, and if it is the same or there is only one name available, then filter those out. I think I can achieve that by creating another dataframe where I groupBy ID and then count name, left join with this dataframe and filter those with count < 2 but that's too hacky and I was wondering if there is a better way of doing this using windows functions. So the above would look like this after filtering undesired rows:

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2

答案1

得分: 1

由于我理解每个ID最多有2行，我们可以在字符串变量上使用min和max函数，并进行比较；诀窍是相同字符串值的最小值和最大值将相等。

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('ID')

df = (df
  .withColumn('Name1', F.min('Name').over(w))
  .withColumn('Name2', F.max('Name').over(w))
  .withColumn('n_rows', F.count('*').over(w))
  .filter( (F.col('Name1') != F.col('Name2')) & (F.col('n_rows') > 1) )
  .select('ID', 'Name', 'Score', 'Rank')
)

df.show()
+---+----+-----+----+
| ID|Name|Score|Rank|
+---+----+-----+----+
|  1|   A|   10|   1|
|  1|   B|   20|   2|
+---+----+-----+----+

英文:

Since I understood that each ID has 2 rows at most, we can use a trick with min and max functions on a string variable and compare them; the trick being that min and max of a same string value would be equal.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy(&#39;ID&#39;)

df = (df
  .withColumn(&#39;Name1&#39;, F.min(&#39;Name&#39;).over(w))
  .withColumn(&#39;Name2&#39;, F.max(&#39;Name&#39;).over(w))
  .withColumn(&#39;n_rows&#39;, F.count(&#39;*&#39;).over(w))
  .filter( (F.col(&#39;Name1&#39;) != F.col(&#39;Name2&#39;)) &amp; (F.col(&#39;n_rows&#39;) &gt; 1) )
  .select(&#39;ID&#39;, &#39;Name&#39;, &#39;Score&#39;, &#39;Rank&#39;)
)

df.show()
+---+----+-----+----+
| ID|Name|Score|Rank|
+---+----+-----+----+
|  1|   A|   10|   1|
|  1|   B|   20|   2|
+---+----+-----+----+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark：检查列的连续值是否相同

问题

答案1

Google Colab：在%%shell之后使用%%python出现CalledProcessError

Kivy的ScreenManager不显示第一个屏幕。

Python Flask应用程序中的subprocess.run失败 [Errno 2] 没有这个文件或目录: ‘ls’: ‘ls’

如何在Python中注释用户定义的集合？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论