如何使用以数组结构作为参数的Spark UDF来构建新列?

huangapple go评论65阅读模式
英文:

How to use a Spark UDF that takes an array of struct as an argument in order to build a new column?

问题

The goal is to create a column named "subscriberPresent" based on the content of the "id" and "profile" columns using the given condition. One way to achieve this without a UDF is to use Spark SQL's built-in functions. You can do this with the when and otherwise functions. Here's how you can do it:

import org.apache.spark.sql.functions._

val result = df.withColumn("subscriberPresent", when(col("contractArray.profile") === "SUBSCRIBER" && col("contractArray.id").isNotNull, true).otherwise(false))

This code will add the "subscriberPresent" column to your DataFrame based on the condition you provided.

Please note that the column references in the code (col("contractArray.profile") and col("contractArray.id")) assume that you are working with a nested structure where "profile" and "id" are nested within the "contractArray" array. You might need to adjust these column references if your DataFrame structure is different.

英文:

I have the following df DataFrame:

df.printSchema()
root
 |-- code: string (nullable = true)
 |-- contractId: string (nullable = true)
 |-- contractArray: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- profile: string (nullable = true)
 |    |    |-- id: string (nullable = true)

df.show()
+---------------+----------------------------------------+
|code|contractId|                           contractArray|
+---------------+----------------------------------------+
|   A|      45 8|    [{CONSUMER, 789}, {SUBSCRIBER, 789}]|
|  AC|    7896 0|                      [{CONSUMER, null}]|
|  BB|      12 7|  [{CONSUMER, null}, {SUBSCRIBER, null}]|
| CCC|     753 8|  [{SUBSCRIBER, null}, {CONSUMER, 7854}]|
+-----------------+--------------------------------------+

The goal is to create a column named subscriberPresent which would contain a boolean based on the content of the id and profile columns. The condition to respect for the contents of the subscriberPresent column is:

if(col("role") === "SUBSCRIBER" && col("id") != null) true 
else false

So, the desired result is the following:

+---------------+----------------------------------------+-----------------+
|code|contractId|                           contractArray|subscriberPresent|
+---------------+----------------------------------------+-----------------+
|   A|      45 8|    [{CONSUMER, 789}, {SUBSCRIBER, 789}]|             true|
|  AC|    7896 0|                      [{CONSUMER, null}]|            false|
|  BB|      12 7|  [{CONSUMER, null}, {SUBSCRIBER, null}]|            false|
| CCC|     753 8|  [{SUBSCRIBER, null}, {CONSUMER, 7854}]|            false|
+-----------------+--------------------------------------+-----------------+

I was thinking of making a UDF to handle this case but there may be another way to achieve it. Do you have any suggestions ?

答案1

得分: 1

你可以使用Spark中的内置whenexists函数来实现所需的结果,而无需使用UDF。

尝试运行这段代码:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, expr}

// 创建一个SparkSession(如果尚未创建)
val spark = SparkSession.builder().getOrCreate()

// 假设您的DataFrame已经定义为'df'
val resultDF = df.withColumn(
  "subscriberPresent",
  expr("exists(contractArray, x -> x.profile = 'SUBSCRIBER' and x.id is not null)")
)

resultDF.show()
英文:

You can achieve the desired result without using a UDF by using built-in
when and exists functions in Spark.

Try out this code:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, expr}

// Create a SparkSession (if you haven't already)
val spark = SparkSession.builder().getOrCreate()

// Assuming your DataFrame is already defined as 'df'
val resultDF = df.withColumn(
  "subscriberPresent",
  expr("exists(contractArray, x -> x.profile = 'SUBSCRIBER' and x.id is not null)")
)

resultDF.show()

huangapple
  • 本文由 发表于 2023年6月29日 21:21:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581480.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定