将一个Spark DataFrame收集到一个单独的行中。

huangapple go评论68阅读模式
英文:

Take a spark dataframe and collect all rows into one single row

问题

Certainly! You can achieve that by using the concat_ws function in Spark. Here's the code to transform your DataFrame as you described:

from pyspark.sql import functions as F

# Assuming you already have 'df' DataFrame defined

# Concatenate all values into one column separated by space
new_df = df.withColumn("new_column", F.concat_ws(" ", "id", "label"))

# Show the new DataFrame
new_df.show()

This will result in a DataFrame with a single column named "new_column" containing values like '1 foo' and '2 bar', just as you wanted.

英文:

is there a way to take a relational spark dataframe like the data below:

df = spark.createDataFrame(
    [
        (1, "foo"),  
        (2, "bar"),
    ],
    ["id", "label"]  
)

df.show()

And collect all of the values (I don't care about the column names) into one column so it looks like below

new_df = spark.createDataFrame(["1 foo 2 bar"], "string").toDF("new_column")
new_df.show()

I do need to keep the order, so it has to be a string of '1 foo 2 bar', and not '1 2 foo bar' for example.

Is there a way to do this?
Thanks

答案1

得分: 1

Sure, here is the translated code:

是的请尝试使用 **`concat_ws()`****`collect_list() + array_join()`** 函数

**`示例:`**

```python
from pyspark.sql.functions import *
df = spark.createDataFrame([(1, "foo"), (2, "bar")], ["id", "label"])

df.withColumn("temp", concat_ws(" ", *df.columns)).groupBy(lit(1)).agg(array_join(collect_list(col("temp")), " ").alias("new_column")).\
  drop("1").\
  show(10, False)
#+-----------+
#|new_column |
#+-----------+
#|1 foo 2 bar|
#+-----------+

Please note that the code and comments have been translated, and only the translated parts are provided.

英文:

Yes try with concat_ws() and collect_list() + array_join() functions.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([(1, "foo"),(2, "bar"),],["id", "label"])

df.withColumn("temp", concat_ws(" ", *df.columns)).groupBy(lit(1)).agg(array_join(collect_list(col("temp"))," ").alias("new_column")).\
  drop("1").\
  show(10,False)
#+-----------+
#|new_column |
#+-----------+
#|1 foo 2 bar|
#+-----------+

答案2

得分: 0

尝试这个:

df2 = spark.sql("select id, label, lead(id) over (order by id) as id_1, lead(label) over (order by id) as label_2 from df")
df2.createOrReplaceTempView("df2")
df3 = spark.sql("select concat(CONCAT(id, ' ', label), ' ', concat(id_1, ' ', label_2)) as one_col from df2 where id_1 is not null")
df3.show()

+-----------+
|    one_col|
+-----------+
|1 foo 2 bar|
+-----------+
英文:

try this:

df2=spark.sql("select id,label,lead(id)over(order by id) as id_1,lead(label)over(order by id) as label_2 from df  ")
df2.createOrReplaceTempView("df2")
df3=spark.sql("select concat(CONCAT(id, ' ',  label)  ,  ' ',concat(id_1 ,' ', label_2)) as one_col from df2 where id_1 is not null")
df3.show()


+-----------+
|    one_col|
+-----------+
|1 foo 2 bar|
+-----------+

huangapple
  • 本文由 发表于 2023年5月18日 06:11:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276506.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定