英文:
Pyspark: check if the consecutive values of a column are the same
问题
我有一个带有以下格式的pyspark dataframe:
ID Name Score Rank
1 A 10 1
1 B 20 2
2 C 10 1
2 C 12 2
3 D 11 1
4 E 12 1
4 E 13 2
目标是找到那些在排名1和2的情况下名称不相同的ID,并且如果名称相同或只有一个名称可用,则将其过滤掉。我认为可以通过创建另一个数据框,在ID上进行groupBy,然后计算名称的数量,左连接到这个数据框,并筛选出那些计数<2的行,但这太繁琐了,我想知道是否有更好的方法可以使用窗口函数来完成这个任务。因此,在过滤掉不需要的行后,上述内容将如下所示:
ID Name Score Rank
1 A 10 1
1 B 20 2
英文:
I have a pyspark dataframe with the following format:
ID Name Score Rank
1 A 10 1
1 B 20 2
2 C 10 1
2 C 12 2
3 D 11 1
4 E 12 1
4 E 13 2
goal is to find those ids where name is not the same for ranks 1 and 2, and if it is the same or there is only one name available, then filter those out. I think I can achieve that by creating another dataframe where I groupBy ID and then count name, left join with this dataframe and filter those with count < 2 but that's too hacky and I was wondering if there is a better way of doing this using windows functions. So the above would look like this after filtering undesired rows:
ID Name Score Rank
1 A 10 1
1 B 20 2
答案1
得分: 1
由于我理解每个ID最多有2行,我们可以在字符串变量上使用min
和max
函数,并进行比较;诀窍是相同字符串值的最小值和最大值将相等。
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('ID')
df = (df
.withColumn('Name1', F.min('Name').over(w))
.withColumn('Name2', F.max('Name').over(w))
.withColumn('n_rows', F.count('*').over(w))
.filter( (F.col('Name1') != F.col('Name2')) & (F.col('n_rows') > 1) )
.select('ID', 'Name', 'Score', 'Rank')
)
df.show()
+---+----+-----+----+
| ID|Name|Score|Rank|
+---+----+-----+----+
| 1| A| 10| 1|
| 1| B| 20| 2|
+---+----+-----+----+
英文:
Since I understood that each ID has 2 rows at most, we can use a trick with min
and max
functions on a string variable and compare them; the trick being that min and max of a same string value would be equal.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('ID')
df = (df
.withColumn('Name1', F.min('Name').over(w))
.withColumn('Name2', F.max('Name').over(w))
.withColumn('n_rows', F.count('*').over(w))
.filter( (F.col('Name1') != F.col('Name2')) & (F.col('n_rows') > 1) )
.select('ID', 'Name', 'Score', 'Rank')
)
df.show()
+---+----+-----+----+
| ID|Name|Score|Rank|
+---+----+-----+----+
| 1| A| 10| 1|
| 1| B| 20| 2|
+---+----+-----+----+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论