英文:
concat all distinct value of several columns into a column in Pyspark
问题
我想要获取这样的结果:
| C1 | C2 | C3 | C4 | 
|---|---|---|---|
| a | e | i | a b c d e f g h i j k l | 
| b | f | j | a b c d e f g h i j k l | 
| c | g | k | a b c d e f g h i j k l | 
| d | h | l | a b c d e f g h i j k l | 
你知道如何在Python的Pyspark中实现这个吗?谢谢!
英文:
I have a dataset like this :
| C1 | C2 | C3 | 
|---|---|---|
| a | e | i | 
| b | f | j | 
| c | g | k | 
| d | h | l | 
I want to obtain this :
| C1 | C2 | C3 | C4 | 
|---|---|---|---|
| a | e | i | a b c d e f g h i j k l | 
| b | f | j | a b c d e f g h i j k l | 
| c | g | k | a b c d e f g h i j k l | 
| d | h | l | a b c d e f g h i j k l | 
Do you know how to do this in python Pyspark ? Thanks !
答案1
得分: 3
尝试使用window函数,然后使用array_sort,flatten函数以及collect_set函数。
示例:
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
  .withColumn("temp_c2",collect_set(col("c2")).over(w))\
  .withColumn("temp_c3",collect_set(col("c3")).over(w))\
  .withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
    drop(*['temp_c1','temp_c2','temp_c3'])
df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4          |
#+---+---+---+------------+
#|b  |f  |j  |abcdefghijkl|
#|c  |g  |k  |abcdefghijkl|
#|d  |h  |l  |abcdefghijkl|
#|a  |e  |i  |abcdefghijkl|
#+---+---+---+------------+
英文:
Try with window function and then use array_sort,flatten functions with collect_set.
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
  .withColumn("temp_c2",collect_set(col("c2")).over(w))\
  .withColumn("temp_c3",collect_set(col("c3")).over(w))\
  .withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
    drop(*['temp_c1','temp_c2','temp_c3'])
df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4          |
#+---+---+---+------------+
#|b  |f  |j  |abcdefghijkl|
#|c  |g  |k  |abcdefghijkl|
#|d  |h  |l  |abcdefghijkl|
#|a  |e  |i  |abcdefghijkl|
#+---+---+---+------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论