2023年2月27日 14:25:12go评论123阅读模式

英文:

How to call Spark Java UDF in PySpark without using SQL?

问题

让我们假设我在Java中实现了一个UDF。

package io.test;

import org.apache.spark.sql.api.java.UDF1;

public class TestUDF implements UDF1<Integer, Integer> {

  @Override
  public Integer call(Integer i) throws Exception {
    // Some Operations
  }

}

我正在使用PySpark，所以我可以在Python驱动程序脚本中注册UDF，如下所示：

session.udf.registerJavaFunction("test_udf", "io.test.TestUDF", IntegerType())

现在，我可以在Pyspark中的SQL查询中使用这个UDF，例如：

df.createOrReplaceTempView("MyTable")
df2 = spark_session.sql("select test_udf(my_col) as mapped from MyTable")

但是，我想知道在PySpark中是否有一种非SQL的方式来实现这一点，例如：

df.withColumn("mapped", functions.callUDF("test_udf", df.col("my_col")))

我检查过并发现callUDF()函数在Spark的Java API中可用，但在PySpark中不可用。

英文:

Let's say I have implemented a UDF in Java.

package io.test;

import org.apache.spark.sql.api.java.UDF1;

public class TestUDF implements UDF1&lt;Integer, Integer&gt; {

  @Override
  public Integer call(Integer i) throws Exception {
    // Some Operations
  }

}

I am using PySpark, so I am able to register the UDF in my Python driver script as follow :

session.udf.registerJavaFunction(&quot;test_udf&quot;, &quot;io.test.TestUDF&quot;, IntegerType())

This UDF is now available to me to be used in SQL queries in Pyspark, e.g.

df.createOrReplaceTempView(&quot;MyTable&quot;)
df2 = spark_session.sql(&quot;select test_udf(my_col) as mapped from MyTable&quot;)

However, I am wondering if there is a non-SQL way of achieving this in PySpark, e.g. something like below :

df.withColumn(&quot;mapped&quot;, functions.callUDF(&quot;test_udf&quot;, df.col(&quot;my_col&quot;)))

I checked and found that callUDF() function is available in Spark Java API but not in PySpark.

答案1

得分: 0

更新:

要使用在Java中实现的UDF，请使用 expr：

.withColumn("asStr", F.expr("asStr(Number)"))

您可以将函数转换为UDF，使用 functions.udf 如下所示：

TestUDF = udf(TestUDFFunction, IntegerType())

然后像这样使用它：

myDF.withColumn("mapped", TestUDF(F.col("my_col")))

示例：

输入：

UDF：

def asStr(num):
  return "Number is " + str(num)

asStr_udf = F.udf(asStr, StringType())

df1.withColumn("asStr", asStr_udf(F.col("Number"))).show()

输出：

英文:

Update:

To use UDF implemented in Java, use it with expr:

.withColumn(&quot;asStr&quot;, F.expr(&quot;asStr(Number)&quot;))

You can convert the function to udf using functions.udf like this

TestUDF = udf(TestUDFFunction, IntegerType())

and use it as below:

myDF.withColumn(&quot;mapped&quot;, TestUDF(F.col(&quot;my_col&quot;)))

Example:

Input:

UDF:

def asStr(num):
  return &quot;Number is &quot;+ str(num)

asStr_udf= F.udf(asStr, StringType())

df1.withColumn(&quot;asStr&quot;, asStr_udf(F.col(&quot;Number&quot;))).show()

Output:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在PySpark中调用Spark Java UDF而不使用SQL？

问题

答案1

Deltalake on Spark: 从Azure存储读取表时初始化配置失败。

在SparkApplication资源中的“Volume Mount”不起作用。

无法在Spring Boot中创建SparkSession。

TypeError in pySpark UDF functions

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论