英文:
Convert an Array column to Array of Structs in PySpark dataframe
问题
我有一个包含3列的数据框
| str1 | array_of_str1 | array_of_str2 |
+-----------+----------------------+----------------+
| John | [Size, Color] | [M, Black] |
| Tom | [Size, Color] | [L, White] |
| Matteo | [Size, Color] | [M, Red] |
我想要添加一个包含这3列的结构类型的数组列
| str1 | array_of_str1 | array_of_str2 | concat_result |
+-----------+----------------------+----------------+-----------------------------------------------+
| John | [Size, Color] | [M, Black] | [[[John, Size , M], [John, Color, Black]]] |
| Tom | [Size, Color] | [L, White] | [[[Tom, Size , L], [Tom, Color, White]]] |
| Matteo | [Size, Color] | [M, Red] | [[[Matteo, Size , M], [Matteo, Color, Red]]] |
英文:
I have a Dataframe containing 3 columns
| str1 | array_of_str1 | array_of_str2 |
+-----------+----------------------+----------------+
| John | [Size, Color] | [M, Black] |
| Tom | [Size, Color] | [L, White] |
| Matteo | [Size, Color] | [M, Red] |
I want to add the Array column that contains the 3 columns in a struct type
| str1 | array_of_str1 | array_of_str2 | concat_result |
+-----------+----------------------+----------------+-----------------------------------------------+
| John | [Size, Color] | [M, Black] | [[[John, Size , M], [John, Color, Black]]] |
| Tom | [Size, Color] | [L, White] | [[[Tom, Size , L], [Tom, Color, White]]] |
| Matteo | [Size, Color] | [M, Red] | [[[Matteo, Size , M], [Matteo, Color, Red]]] |
答案1
得分: 9
在数组中的元素数量固定的情况下,使用array
和struct
函数非常简单。以下是Scala和Python的代码示例。
在Scala中:
val result = df
.withColumn("concat_result", array((0 to 1).map(i => struct(
col("str1"),
col("array_of_str1").getItem(i),
col("array_of_str2").getItem(i)
)) : _*))
在Python中(使用pyspark):
import pyspark.sql.functions as F
df.withColumn("concat_result", F.array(*[F.struct(
F.col("str1"),
F.col("array_of_str1").getItem(i),
F.col("array_of_str2").getItem(i))
for i in range(2)]))
这将生成以下模式:
root
|-- str1: string (nullable = true)
|-- array_of_str1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- array_of_str2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- concat_result: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- str1: string (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: string (nullable = true)
英文:
If the number of elements in the arrays in fixed, it is quite straightforward using the array
and struct
functions. Here is a bit of code in scala.
val result = df
.withColumn("concat_result", array((0 to 1).map(i => struct(
col("str1"),
col("array_of_str1").getItem(i),
col("array_of_str2").getItem(i)
)) : _*))
And in python, since you were asking about pyspark:
import pyspark.sql.functions as F
df.withColumn("concat_result", F.array(*[ F.struct(
F.col("str1"),
F.col("array_of_str1").getItem(i),
F.col("array_of_str2").getItem(i))
for i in range(2)]))
And you get the following schema:
root
|-- str1: string (nullable = true)
|-- array_of_str1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- array_of_str2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- concat_result: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- str1: string (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: string (nullable = true)
答案2
得分: 0
Spark >= 2.4.x
对于动态值,您可以使用高阶函数:
import pyspark.sql.functions as f
expr = "TRANSFORM(arrays_zip(array_of_str1, array_of_str2), x -> struct(str1, concat(x.array_of_str1), concat(x.array_of_str2)))"
df = df.withColumn('concat_result', f.expr(expr))
df.show(truncate=False)
模式和输出:
root
|-- array_of_str1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- array_of_str2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str1: string (nullable = true)
|-- concat_result: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- str1: string (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: string (nullable = true)
+-------------+-------------+------+-----------------------------------------+
|array_of_str1|array_of_str2|str1 |concat_result |
+-------------+-------------+------+-----------------------------------------+
|[Size, Color]|[M, Black] |John |[[John, Size, M], [John, Color, Black]] |
|[Size, Color]|[L, White] |Tom |[[Tom, Size, L], [Tom, Color, White]] |
|[Size, Color]|[M, Red] |Matteo|[[Matteo, Size, M], [Matteo, Color, Red]]|
+-------------+-------------+------+-----------------------------------------+
英文:
Spark >= 2.4.x
For dynamically values you can use high-order functions:
import pyspark.sql.functions as f
expr = "TRANSFORM(arrays_zip(array_of_str1, array_of_str2), x -> struct(str1, concat(x.array_of_str1), concat(x.array_of_str2)))"
df = df.withColumn('concat_result', f.expr(expr))
df.show(truncate=False)
Schema and output:
root
|-- array_of_str1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- array_of_str2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str1: string (nullable = true)
|-- concat_result: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- str1: string (nullable = true)
| | |-- col2: string (nullable = true)
| | |-- col3: string (nullable = true)
+-------------+-------------+------+-----------------------------------------+
|array_of_str1|array_of_str2|str1 |concat_result |
+-------------+-------------+------+-----------------------------------------+
|[Size, Color]|[M, Black] |John |[[John, Size, M], [John, Color, Black]] |
|[Size, Color]|[L, White] |Tom |[[Tom, Size, L], [Tom, Color, White]] |
|[Size, Color]|[M, Red] |Matteo|[[Matteo, Size, M], [Matteo, Color, Red]]|
+-------------+-------------+------+-----------------------------------------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论