2023年5月18日 06:11:10go评论141阅读模式

英文:

Take a spark dataframe and collect all rows into one single row

问题

Certainly! You can achieve that by using the concat_ws function in Spark. Here's the code to transform your DataFrame as you described:

from pyspark.sql import functions as F

# Assuming you already have 'df' DataFrame defined

# Concatenate all values into one column separated by space
new_df = df.withColumn("new_column", F.concat_ws(" ", "id", "label"))

# Show the new DataFrame
new_df.show()

This will result in a DataFrame with a single column named "new_column" containing values like '1 foo' and '2 bar', just as you wanted.

英文:

is there a way to take a relational spark dataframe like the data below:

df = spark.createDataFrame(
    [
        (1, &quot;foo&quot;),  
        (2, &quot;bar&quot;),
    ],
    [&quot;id&quot;, &quot;label&quot;]  
)

df.show()

And collect all of the values (I don't care about the column names) into one column so it looks like below

new_df = spark.createDataFrame([&quot;1 foo 2 bar&quot;], &quot;string&quot;).toDF(&quot;new_column&quot;)
new_df.show()

I do need to keep the order, so it has to be a string of '1 foo 2 bar', and not '1 2 foo bar' for example.

Is there a way to do this?
Thanks

答案1

得分: 1

Sure, here is the translated code:

是的，请尝试使用 **`concat_ws()`** 和 **`collect_list() + array_join()`** 函数。

**`示例:`**

```python
from pyspark.sql.functions import *
df = spark.createDataFrame([(1, "foo"), (2, "bar")], ["id", "label"])

df.withColumn("temp", concat_ws(" ", *df.columns)).groupBy(lit(1)).agg(array_join(collect_list(col("temp")), " ").alias("new_column")).\
  drop("1").\
  show(10, False)
#+-----------+
#|new_column |
#+-----------+
#|1 foo 2 bar|
#+-----------+

Please note that the code and comments have been translated, and only the translated parts are provided.

英文:

Yes try with concat_ws() and collect_list() + array_join() functions.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([(1, &quot;foo&quot;),(2, &quot;bar&quot;),],[&quot;id&quot;, &quot;label&quot;])

df.withColumn(&quot;temp&quot;, concat_ws(&quot; &quot;, *df.columns)).groupBy(lit(1)).agg(array_join(collect_list(col(&quot;temp&quot;)),&quot; &quot;).alias(&quot;new_column&quot;)).\
  drop(&quot;1&quot;).\
  show(10,False)
#+-----------+
#|new_column |
#+-----------+
#|1 foo 2 bar|
#+-----------+

答案2

得分: 0

尝试这个：

df2 = spark.sql("select id, label, lead(id) over (order by id) as id_1, lead(label) over (order by id) as label_2 from df")
df2.createOrReplaceTempView("df2")
df3 = spark.sql("select concat(CONCAT(id, ' ', label), ' ', concat(id_1, ' ', label_2)) as one_col from df2 where id_1 is not null")
df3.show()

+-----------+
|    one_col|
+-----------+
|1 foo 2 bar|
+-----------+

英文:

try this:

df2=spark.sql(&quot;select id,label,lead(id)over(order by id) as id_1,lead(label)over(order by id) as label_2 from df  &quot;)
df2.createOrReplaceTempView(&quot;df2&quot;)
df3=spark.sql(&quot;select concat(CONCAT(id, &#39; &#39;,  label)  ,  &#39; &#39;,concat(id_1 ,&#39; &#39;, label_2)) as one_col from df2 where id_1 is not null&quot;)
df3.show()


+-----------+
|    one_col|
+-----------+
|1 foo 2 bar|
+-----------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将一个Spark DataFrame收集到一个单独的行中。

问题

答案1

答案2

如何使用空列表初始化QApplication对象？

获得错误 – AADSTS500011：在名为的租户中未找到名为的资源主体。

自定义颜色的matplotlib线条，但图例不会更新。

从 pandas 数据框中提取相关行，当存在重复列数值时。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论