2020年1月6日 15:05:56go评论264阅读模式

英文:

Convert an Array column to Array of Structs in PySpark dataframe

问题

我有一个包含3列的数据框

| str1      | array_of_str1        | array_of_str2  |
+-----------+----------------------+----------------+
| John      | [Size, Color]        | [M, Black]     |
| Tom       | [Size, Color]        | [L, White]     |
| Matteo    | [Size, Color]        | [M, Red]       |

我想要添加一个包含这3列的结构类型的数组列

| str1      | array_of_str1        | array_of_str2  | concat_result                                   |
+-----------+----------------------+----------------+-----------------------------------------------+
| John      | [Size, Color]        | [M, Black]     | [[[John, Size , M], [John, Color, Black]]]    |
| Tom       | [Size, Color]        | [L, White]     | [[[Tom, Size , L], [Tom, Color, White]]]      |
| Matteo    | [Size, Color]        | [M, Red]       | [[[Matteo, Size , M], [Matteo, Color, Red]]]  |

英文:

I have a Dataframe containing 3 columns

| str1      | array_of_str1        | array_of_str2  |
+-----------+----------------------+----------------+
| John      | [Size, Color]		   | [M, Black]    	|
| Tom       | [Size, Color]		   | [L, White]		|
| Matteo    | [Size, Color]		   | [M, Red]		|

I want to add the Array column that contains the 3 columns in a struct type

| str1      | array_of_str1        | array_of_str2  | concat_result									|
+-----------+----------------------+----------------+-----------------------------------------------+
| John      | [Size, Color]		   | [M, Black]    	| [[[John, Size , M], [John, Color, Black]]]	|
| Tom       | [Size, Color]		   | [L, White]		| [[[Tom, Size , L], [Tom, Color, White]]]		|
| Matteo    | [Size, Color]		   | [M, Red]		| [[[Matteo, Size , M], [Matteo, Color, Red]]]	|

答案1

得分: 9

在数组中的元素数量固定的情况下，使用array和struct函数非常简单。以下是Scala和Python的代码示例。

在Scala中：

val result = df
    .withColumn("concat_result", array((0 to 1).map(i => struct(
                     col("str1"),
                     col("array_of_str1").getItem(i),
                     col("array_of_str2").getItem(i)
    )) : _*))

在Python中（使用pyspark）：

import pyspark.sql.functions as F

df.withColumn("concat_result", F.array(*[F.struct(
                  F.col("str1"),
                  F.col("array_of_str1").getItem(i),
                  F.col("array_of_str2").getItem(i))
              for i in range(2)]))

这将生成以下模式：

root
 |-- str1: string (nullable = true)
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- concat_result: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

英文:

If the number of elements in the arrays in fixed, it is quite straightforward using the array and struct functions. Here is a bit of code in scala.

val result = df
    .withColumn(&quot;concat_result&quot;, array((0 to 1).map(i =&gt; struct(
                     col(&quot;str1&quot;),
                     col(&quot;array_of_str1&quot;).getItem(i),
                     col(&quot;array_of_str2&quot;).getItem(i)
    )) : _*))

And in python, since you were asking about pyspark:

import pyspark.sql.functions as F

df.withColumn(&quot;concat_result&quot;, F.array(*[ F.struct(
                  F.col(&quot;str1&quot;),
                  F.col(&quot;array_of_str1&quot;).getItem(i),
                  F.col(&quot;array_of_str2&quot;).getItem(i))
              for i in range(2)]))

And you get the following schema:

root
 |-- str1: string (nullable = true)
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- concat_result: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

答案2

得分: 0

Spark >= 2.4.x

对于动态值，您可以使用高阶函数：

import pyspark.sql.functions as f

expr = "TRANSFORM(arrays_zip(array_of_str1, array_of_str2), x -> struct(str1, concat(x.array_of_str1), concat(x.array_of_str2)))"
df = df.withColumn('concat_result', f.expr(expr))

df.show(truncate=False)

模式和输出：

root
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str1: string (nullable = true)
 |-- concat_result: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

+-------------+-------------+------+-----------------------------------------+
|array_of_str1|array_of_str2|str1  |concat_result                            |
+-------------+-------------+------+-----------------------------------------+
|[Size, Color]|[M, Black]   |John  |[[John, Size, M], [John, Color, Black]]  |
|[Size, Color]|[L, White]   |Tom   |[[Tom, Size, L], [Tom, Color, White]]    |
|[Size, Color]|[M, Red]     |Matteo|[[Matteo, Size, M], [Matteo, Color, Red]]|
+-------------+-------------+------+-----------------------------------------+

英文:

Spark >= 2.4.x

For dynamically values you can use high-order functions:

import pyspark.sql.functions as f

expr = &quot;TRANSFORM(arrays_zip(array_of_str1, array_of_str2), x -&gt; struct(str1, concat(x.array_of_str1), concat(x.array_of_str2)))&quot;
df = df.withColumn(&#39;concat_result&#39;, f.expr(expr))

df.show(truncate=False)

Schema and output:

root
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str1: string (nullable = true)
 |-- concat_result: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

+-------------+-------------+------+-----------------------------------------+
|array_of_str1|array_of_str2|str1  |concat_result                            |
+-------------+-------------+------+-----------------------------------------+
|[Size, Color]|[M, Black]   |John  |[[John, Size, M], [John, Color, Black]]  |
|[Size, Color]|[L, White]   |Tom   |[[Tom, Size, L], [Tom, Color, White]]    |
|[Size, Color]|[M, Red]     |Matteo|[[Matteo, Size, M], [Matteo, Color, Red]]|
+-------------+-------------+------+-----------------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将PySpark数据框中的数组列转换为结构数组。

问题

答案1

答案2

Spark >= 2.4.x

Spark >= 2.4.x

如何从 if 语句中返回一个值到 Go 语言的函数中

在Pandas数据框中排序

为什么突然导入与之前完全相同的 Python 模块变得如此缓慢？

if the name exists in the Excel, check if a folder for that name exists, if true copy the Folder.

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论