将多个列的所有不同值连接成一个列在 Pyspark 中。

huangapple go评论64阅读模式
英文:

concat all distinct value of several columns into a column in Pyspark

问题

我想要获取这样的结果:

C1 C2 C3 C4
a e i a b c d e f g h i j k l
b f j a b c d e f g h i j k l
c g k a b c d e f g h i j k l
d h l a b c d e f g h i j k l

你知道如何在Python的Pyspark中实现这个吗?谢谢!

英文:

I have a dataset like this :

C1 C2 C3
a e i
b f j
c g k
d h l

I want to obtain this :

C1 C2 C3 C4
a e i a b c d e f g h i j k l
b f j a b c d e f g h i j k l
c g k a b c d e f g h i j k l
d h l a b c d e f g h i j k l

Do you know how to do this in python Pyspark ? Thanks !

答案1

得分: 3

尝试使用window函数,然后使用array_sort,flatten函数以及collect_set函数。

示例:

from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
  .withColumn("temp_c2",collect_set(col("c2")).over(w))\
  .withColumn("temp_c3",collect_set(col("c3")).over(w))\
  .withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
    drop(*['temp_c1','temp_c2','temp_c3'])

df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4          |
#+---+---+---+------------+
#|b  |f  |j  |abcdefghijkl|
#|c  |g  |k  |abcdefghijkl|
#|d  |h  |l  |abcdefghijkl|
#|a  |e  |i  |abcdefghijkl|
#+---+---+---+------------+
英文:

Try with window function and then use array_sort,flatten functions with collect_set.

Example:

from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
  .withColumn("temp_c2",collect_set(col("c2")).over(w))\
  .withColumn("temp_c3",collect_set(col("c3")).over(w))\
  .withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
    drop(*['temp_c1','temp_c2','temp_c3'])

df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4          |
#+---+---+---+------------+
#|b  |f  |j  |abcdefghijkl|
#|c  |g  |k  |abcdefghijkl|
#|d  |h  |l  |abcdefghijkl|
#|a  |e  |i  |abcdefghijkl|
#+---+---+---+------------+

huangapple
  • 本文由 发表于 2023年4月19日 22:32:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76055730.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定