问题

我想要获取这样的结果：

C1	C2	C3	C4
a	e	i	a b c d e f g h i j k l
b	f	j	a b c d e f g h i j k l
c	g	k	a b c d e f g h i j k l
d	h	l	a b c d e f g h i j k l

你知道如何在Python的Pyspark中实现这个吗？谢谢！

英文:

I have a dataset like this :

C1	C2	C3
a	e	i
b	f	j
c	g	k
d	h	l

I want to obtain this :

C1	C2	C3	C4
a	e	i	a b c d e f g h i j k l
b	f	j	a b c d e f g h i j k l
c	g	k	a b c d e f g h i j k l
d	h	l	a b c d e f g h i j k l

Do you know how to do this in python Pyspark ? Thanks !

答案1

得分: 3

尝试使用window函数，然后使用array_sort,flatten函数以及collect_set函数。

示例：

from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn("temp_c1",collect_set(col("c1")).over(w))\
  .withColumn("temp_c2",collect_set(col("c2")).over(w))\
  .withColumn("temp_c3",collect_set(col("c3")).over(w))\
  .withColumn("c4",array_join(array_distinct(sort_array(flatten(array(col("temp_c1"),col("temp_c2"),col("temp_c3"))))),'')).\
    drop(*['temp_c1','temp_c2','temp_c3'])

df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4          |
#+---+---+---+------------+
#|b  |f  |j  |abcdefghijkl|
#|c  |g  |k  |abcdefghijkl|
#|d  |h  |l  |abcdefghijkl|
#|a  |e  |i  |abcdefghijkl|
#+---+---+---+------------+

英文:

Try with window function and then use array_sort,flatten functions with collect_set.

Example:

from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy(lit(1)).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df2= df.withColumn(&quot;temp_c1&quot;,collect_set(col(&quot;c1&quot;)).over(w))\
  .withColumn(&quot;temp_c2&quot;,collect_set(col(&quot;c2&quot;)).over(w))\
  .withColumn(&quot;temp_c3&quot;,collect_set(col(&quot;c3&quot;)).over(w))\
  .withColumn(&quot;c4&quot;,array_join(array_distinct(sort_array(flatten(array(col(&quot;temp_c1&quot;),col(&quot;temp_c2&quot;),col(&quot;temp_c3&quot;))))),&#39;&#39;)).\
    drop(*[&#39;temp_c1&#39;,&#39;temp_c2&#39;,&#39;temp_c3&#39;])

df2.show(10,False)
#+---+---+---+------------+
#|c1 |c2 |c3 |c4          |
#+---+---+---+------------+
#|b  |f  |j  |abcdefghijkl|
#|c  |g  |k  |abcdefghijkl|
#|d  |h  |l  |abcdefghijkl|
#|a  |e  |i  |abcdefghijkl|
#+---+---+---+------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将多个列的所有不同值连接成一个列在 Pyspark 中。

问题

答案1

How to generate Pyspark dynamic frame name dynamically

如何修剪pyspark模式输出

如何在Spark中读取选定的分区

问题：在连接三个具有相同列名称的数据集时，相同关键列的值被替换。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论