Pyspark:检查列的连续值是否相同

huangapple go评论71阅读模式
英文:

Pyspark: check if the consecutive values of a column are the same

问题

我有一个带有以下格式的pyspark dataframe:

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2
2     C     10     1
2     C     12     2
3     D     11     1
4     E     12     1
4     E     13     2

目标是找到那些在排名1和2的情况下名称不相同的ID,并且如果名称相同或只有一个名称可用,则将其过滤掉。我认为可以通过创建另一个数据框,在ID上进行groupBy,然后计算名称的数量,左连接到这个数据框,并筛选出那些计数<2的行,但这太繁琐了,我想知道是否有更好的方法可以使用窗口函数来完成这个任务。因此,在过滤掉不需要的行后,上述内容将如下所示:

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2
英文:

I have a pyspark dataframe with the following format:

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2
2     C     10     1
2     C     12     2
3     D     11     1
4     E     12     1
4     E     13     2

goal is to find those ids where name is not the same for ranks 1 and 2, and if it is the same or there is only one name available, then filter those out. I think I can achieve that by creating another dataframe where I groupBy ID and then count name, left join with this dataframe and filter those with count < 2 but that's too hacky and I was wondering if there is a better way of doing this using windows functions. So the above would look like this after filtering undesired rows:

ID    Name  Score  Rank  
1     A     10     1
1     B     20     2

答案1

得分: 1

由于我理解每个ID最多有2行,我们可以在字符串变量上使用minmax函数,并进行比较;诀窍是相同字符串值的最小值和最大值将相等。

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('ID')

df = (df
  .withColumn('Name1', F.min('Name').over(w))
  .withColumn('Name2', F.max('Name').over(w))
  .withColumn('n_rows', F.count('*').over(w))
  .filter( (F.col('Name1') != F.col('Name2')) & (F.col('n_rows') > 1) )
  .select('ID', 'Name', 'Score', 'Rank')
)

df.show()
+---+----+-----+----+
| ID|Name|Score|Rank|
+---+----+-----+----+
|  1|   A|   10|   1|
|  1|   B|   20|   2|
+---+----+-----+----+
英文:

Since I understood that each ID has 2 rows at most, we can use a trick with min and max functions on a string variable and compare them; the trick being that min and max of a same string value would be equal.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy(&#39;ID&#39;)

df = (df
  .withColumn(&#39;Name1&#39;, F.min(&#39;Name&#39;).over(w))
  .withColumn(&#39;Name2&#39;, F.max(&#39;Name&#39;).over(w))
  .withColumn(&#39;n_rows&#39;, F.count(&#39;*&#39;).over(w))
  .filter( (F.col(&#39;Name1&#39;) != F.col(&#39;Name2&#39;)) &amp; (F.col(&#39;n_rows&#39;) &gt; 1) )
  .select(&#39;ID&#39;, &#39;Name&#39;, &#39;Score&#39;, &#39;Rank&#39;)
)

df.show()
+---+----+-----+----+
| ID|Name|Score|Rank|
+---+----+-----+----+
|  1|   A|   10|   1|
|  1|   B|   20|   2|
+---+----+-----+----+

huangapple
  • 本文由 发表于 2023年6月29日 01:06:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76575327.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定