2023年6月29日 21:21:13go评论143阅读模式

英文:

How to use a Spark UDF that takes an array of struct as an argument in order to build a new column?

问题

The goal is to create a column named "subscriberPresent" based on the content of the "id" and "profile" columns using the given condition. One way to achieve this without a UDF is to use Spark SQL's built-in functions. You can do this with the when and otherwise functions. Here's how you can do it:

import org.apache.spark.sql.functions._

val result = df.withColumn("subscriberPresent", when(col("contractArray.profile") === "SUBSCRIBER" && col("contractArray.id").isNotNull, true).otherwise(false))

This code will add the "subscriberPresent" column to your DataFrame based on the condition you provided.

Please note that the column references in the code (col("contractArray.profile") and col("contractArray.id")) assume that you are working with a nested structure where "profile" and "id" are nested within the "contractArray" array. You might need to adjust these column references if your DataFrame structure is different.

英文:

I have the following df DataFrame:

df.printSchema()
root
 |-- code: string (nullable = true)
 |-- contractId: string (nullable = true)
 |-- contractArray: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- profile: string (nullable = true)
 |    |    |-- id: string (nullable = true)

df.show()
+---------------+----------------------------------------+
|code|contractId|                           contractArray|
+---------------+----------------------------------------+
|   A|      45 8|    [{CONSUMER, 789}, {SUBSCRIBER, 789}]|
|  AC|    7896 0|                      [{CONSUMER, null}]|
|  BB|      12 7|  [{CONSUMER, null}, {SUBSCRIBER, null}]|
| CCC|     753 8|  [{SUBSCRIBER, null}, {CONSUMER, 7854}]|
+-----------------+--------------------------------------+

The goal is to create a column named subscriberPresent which would contain a boolean based on the content of the id and profile columns. The condition to respect for the contents of the subscriberPresent column is:

if(col(&quot;role&quot;) === &quot;SUBSCRIBER&quot; &amp;&amp; col(&quot;id&quot;) != null) true 
else false

So, the desired result is the following:

+---------------+----------------------------------------+-----------------+
|code|contractId|                           contractArray|subscriberPresent|
+---------------+----------------------------------------+-----------------+
|   A|      45 8|    [{CONSUMER, 789}, {SUBSCRIBER, 789}]|             true|
|  AC|    7896 0|                      [{CONSUMER, null}]|            false|
|  BB|      12 7|  [{CONSUMER, null}, {SUBSCRIBER, null}]|            false|
| CCC|     753 8|  [{SUBSCRIBER, null}, {CONSUMER, 7854}]|            false|
+-----------------+--------------------------------------+-----------------+

I was thinking of making a UDF to handle this case but there may be another way to achieve it. Do you have any suggestions ?

答案1

得分: 1

你可以使用Spark中的内置when和exists函数来实现所需的结果，而无需使用UDF。

尝试运行这段代码：

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, expr}

// 创建一个SparkSession（如果尚未创建）
val spark = SparkSession.builder().getOrCreate()

// 假设您的DataFrame已经定义为'df'
val resultDF = df.withColumn(
  "subscriberPresent",
  expr("exists(contractArray, x -> x.profile = 'SUBSCRIBER' and x.id is not null)")
)

resultDF.show()

英文:

You can achieve the desired result without using a UDF by using built-in
when and exists functions in Spark.

Try out this code:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, expr}

// Create a SparkSession (if you haven&#39;t already)
val spark = SparkSession.builder().getOrCreate()

// Assuming your DataFrame is already defined as &#39;df&#39;
val resultDF = df.withColumn(
  &quot;subscriberPresent&quot;,
  expr(&quot;exists(contractArray, x -&gt; x.profile = &#39;SUBSCRIBER&#39; and x.id is not null)&quot;)
)

resultDF.show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用以数组结构作为参数的Spark UDF来构建新列？

问题

答案1

关于 Java 中的 copyEven 的问题

如何在Java中以string-array的方式写入到strings.xml文件？

如何用Python读取Java中写入的字节文件

Java "Type mismatch: cannot convert from int[][] to int" when returning copied array

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论