2023年7月31日 18:52:31go评论93阅读模式

英文:

How to use PySpark UDF in Java / Scala Spark project

问题

有很多关于如何从PySpark调用Java代码的问题，但没有关于从Java Spark项目中调用Python代码的问题。这对于需要在Python中实现的功能的大型旧Java项目非常有用。

英文:

There are a lot of questions: "How to call Java code from PySpark" and no one about Python code calling from Java Spark project. It's useful for big old Java projects, that requires a functionality, that was implemented in Python.

答案1

得分: 2

我也在我的[Medium][1]中分享了答案。

正如您所知，Apache Spark是用Scala编写的。PySpark不是一个独立的完全Python项目。有一个org.apache.spark.deploy.PythonRunner类，它：

创建一个Py4J服务器
导出Py4J服务器的主机、端口和密钥作为环境变量
启动提供的Python脚本
等待脚本进程的结束

反过来，在Python脚本中创建Spark上下文时，它会使用来自环境变量的凭据连接到Py4J服务器。Py4J允许您通过Java反射API使用任何JVM对象。换句话说，PySpark是Java Spark上下文的包装器。

这是一个使用Apache Spark的Python Runner的简单Java应用程序示例：

package example.python;
import org.apache.spark.deploy.PythonRunner;
import org.apache.spark.sql.SparkSession;
public class Main {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("Shared Spark Context Example")
                .master("local[*]")
                .getOrCreate();
        spark.sparkContext().setLogLevel("ERROR");
        PythonRunner.main(new String[]{
                "src/main/python/example.py",
                "src/main/python/example.py"
        });
        spark.stop();
    }
}

但是，如果您尝试在example.py中初始化一个Spark Session，您将会收到异常：在JVM进程中只能有一个SparkContext。因此，第一个问题是：如何将现有的Java SparkContext放入PySpark中？下一个问题是：如何与PySpark共享DataFrame？

要共享现有的SparkContext，需要通过Py4J网关与JVM建立连接，通过公共静态变量提供对org.apache.spark.api.java.JavaSparkContext实例的访问，并通过JavaSparkContext#getConf()初始化pyspark.conf.SparkConf。

使用Spark的表临时视图功能可以共享DataFrame。

以下是Java的更新代码：

package example.python;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.deploy.PythonRunner;
import org.apache.spark.sql.SparkSession;
public class Main {
    public static JavaSparkContext jsc;
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("Spark Python Runner")
                .master("local[*]")
                .getOrCreate();
        spark.sparkContext().setLogLevel("ERROR");
        jsc = new JavaSparkContext(spark.sparkContext());
        var df = spark.read().textFile("src/main/resources/dataset.txt");
        df.createOrReplaceTempView("tbl");
        PythonRunner.main(new String[]{
                "src/main/python/example.py",
                "src/main/python/example.py"
        });
        spark.sql("SELECT * FROM tbl").show();
        spark.stop();
    }
}

以及Python：

#!/usr/bin/env python
# coding: utf-8
import sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import length, udf
from pyspark.sql.types import StringType
if __name__ == "__main__":
    gateway = pyspark.java_gateway.launch_gateway()
    jsc = gateway.jvm.example.python.Main.jsc
    conf = pyspark.conf.SparkConf(True, gateway.jvm, jsc.getConf())
    sc = pyspark.SparkContext(gateway=gateway, jsc=jsc, conf=conf)
    spark = SparkSession(sc)
    df = spark.sql("SELECT * FROM tbl")
    df = df.withColumn("len", length('value').alias('len'))
    df.createOrReplaceTempView("tbl")
    sys.exit(0)

更进一步，您可以在PySpark中注册PythonUDF，并在Java代码中调用它。

Python：

# ...
py_concat_of2_udf = udf(lambda x, y: str(x) + str(y), StringType())
spark.udf.register("py_concat_of2", py_concat_of2_udf)
# ...

Java：

// ...
spark.sql("SELECT * FROM tbl")
        .withColumn("pyfunc", callUDF("py_concat_of2", col("value"), col("len")))
        .show();
// ...

代码的标准输出如下：

+----------+---+------------+
|     value|len|      pyfunc|
+----------+---+------------+
|       one|  3|        one3|
|       two|  3|        two3|
|three four| 10|three four10|
|      five|  4|       five4|
+----------+---+------------+

它是如何工作的？有一个org.apache.spark.sql.catalyst.expressions.PythonUDF Scala类，其中包含一个org.apache.spark.api.python.PythonFunction对象。该对象包含command: Seq[Byte]变量，实际上是由Pickle序列化的Python lambda函数。

这种方法的负面影响是每个操作都会在标准输出中生成一个堆栈跟踪：

ERROR DAGScheduler: Failed to update accumulator 37 (org.apache.spark.api.python.PythonAccumulatorV2) for task 0
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.connect0(Native Method)
at java.base/sun.nio.ch.Net.connect(Net.java:579)
at java.base/sun.nio.ch.Net.connect(Net.java:568)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at java.base/java.net.Socket.connect(Socket.java:583)
at java.base/java.net.Socket.<init>(Socket.java:507)
at java.base/java.net.Socket.<init>(Socket.java:287)
at org.apache.spark.api.python.PythonAccumulatorV2.openSocket(PythonRDD.scala:701)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:723)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1(DAGScheduler.scala:1610)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1$adapted(DAGScheduler.scala:1601)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1749)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2857)
at org.apache.spark.scheduler.DAGScheduler
<details>
<summary>英文:</summary>
I&#39;ve also shared the answer in my [Medium][1].
As you know, Apache Spark is written in Scala. PySpark is not a separate full-python project. There is `org.apache.spark.deploy.PythonRunner` class that:
- creates a Py4J server
- exports the Py4J server’s host, port, and secret as environment variables
- launches the provided Python scripts
- waits for the end of the script’s process
In its turn, when Spark Context is being created in the Python script, it connects to the Py4J server using credentials from the environment variables. Py4J allows you to use any JVM object via the Java Reflection API. In other words, PySpark is a wrapper of the Java Spark Context.
Example of a simple Java app that uses Apache Spark’s Python Runner:
package example.python;
import org.apache.spark.deploy.PythonRunner;
import org.apache.spark.sql.SparkSession;
public class Main {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName(&quot;Shared Spark Context Example&quot;)
.master(&quot;local[*]&quot;)
.getOrCreate();
spark.sparkContext().setLogLevel(&quot;ERROR&quot;);
PythonRunner.main(new String[]{
&quot;src/main/python/example.py&quot;,
&quot;src/main/python/example.py&quot;
});
spark.stop();
}
}
But if you try to initialize a Spark Session in example.py, you get the exception: there is can be only one SparkContext in a JVM process. So the first question is: how to put an existing Java SparkContext into PySpark? And the next question is: how to share a DataFrame with PySpark?
To share an existing SparkContext, it’s needed to make a connection to the JVM over the Py4J gateway, provide access to an instance of `org.apache.spark.api.java.JavaSparkContext` by a public static variable, and initialize `pyspark.conf.SparkConf` by `JavaSparkContext#getConf()`.
Sharing of a DataFrame is possible with Spark’s table temporary view functionality.
Here is the updated code for Java:
package example.python;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.deploy.PythonRunner;
import org.apache.spark.sql.SparkSession;
public class Main {
public static JavaSparkContext jsc;
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName(&quot;Spark Python Runner&quot;)
.master(&quot;local[*]&quot;)
.getOrCreate();
spark.sparkContext().setLogLevel(&quot;ERROR&quot;);
jsc = new JavaSparkContext(spark.sparkContext());
var df = spark.read().textFile(&quot;src/main/resources/dataset.txt&quot;);
df.createOrReplaceTempView(&quot;tbl&quot;);
PythonRunner.main(new String[]{
&quot;src/main/python/example.py&quot;,
&quot;src/main/python/example.py&quot;
});
spark.sql(&quot;SELECT * FROM tbl&quot;).show();
spark.stop();
}
}
And Python:
#!/usr/bin/env python
# coding: utf-8
import sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import length, udf
from pyspark.sql.types import StringType
if __name__ == &quot;__main__&quot;:
gateway = pyspark.java_gateway.launch_gateway()
jsc = gateway.jvm.example.python.Main.jsc
conf = pyspark.conf.SparkConf(True, gateway.jvm, jsc.getConf())
sc = pyspark.SparkContext(gateway=gateway, jsc=jsc, conf=conf)
spark = SparkSession(sc)
df = spark.sql(&quot;SELECT * FROM tbl&quot;)
df = df.withColumn(&quot;len&quot;, length(&#39;value&#39;).alias(&#39;len&#39;))
df.createOrReplaceTempView(&quot;tbl&quot;)
sys.exit(0)
Even more. It’s possible to register a PythonUDF in PySpark and call it in Java code afterwise.
Python:
# ...
py_concat_of2_udf = udf(lambda x, y: str(x) + str(y), StringType())
spark.udf.register(&quot;py_concat_of2&quot;, py_concat_of2_udf)
# ...
Java:
// ...
spark.sql(&quot;SELECT * FROM tbl&quot;)
.withColumn(&quot;pyfunc&quot;, callUDF(&quot;py_concat_of2&quot;, col(&quot;value&quot;), col(&quot;len&quot;)))
.show();
// ...
The stdout of the code:
+----------+---+------------+
|     value|len|      pyfunc|
+----------+---+------------+
|       one|  3|        one3|
|       two|  3|        two3|
|three four| 10|three four10|
|      five|  4|       five4|
+----------+---+------------+
How does it work? There is `org.apache.spark.sql.catalyst.expressions.PythonUDF` Scala class that contains an `org.apache.spark.api.python.PythonFunction` object. The object contains `command: Seq[Byte]` variable, which is actually a Python lambda serialized by Pickle.
The negative side of this approach is a stacktrace in stdout for each action:
ERROR DAGScheduler: Failed to update accumulator 37 (org.apache.spark.api.python.PythonAccumulatorV2) for task 0
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.connect0(Native Method)
at java.base/sun.nio.ch.Net.connect(Net.java:579)
at java.base/sun.nio.ch.Net.connect(Net.java:568)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at java.base/java.net.Socket.connect(Socket.java:583)
at java.base/java.net.Socket.&lt;init&gt;(Socket.java:507)
at java.base/java.net.Socket.&lt;init&gt;(Socket.java:287)
at org.apache.spark.api.python.PythonAccumulatorV2.openSocket(PythonRDD.scala:701)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:723)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1(DAGScheduler.scala:1610)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1$adapted(DAGScheduler.scala:1601)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1749)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2857)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
The `org.apache.spark.api.python.PythonAccumulatorV2` object is created by `pyspark.SparkContext` and is used for Apache Spark metrics.
[1]: https://medium.com/@a.iatsuk/apache-spark-context-sharing-across-jvm-and-python-processes-aadde34872bc
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Java/Scala Spark项目中使用PySpark UDF

问题

答案1

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Is it faster to cast within filter() or cast new withColumn(), then filter in Spark?

Scala error handling – if a part of my Scala code fails, I want to execute a function() as part of the program

将Hive和Hadoop的JAR包添加到IntelliJ项目。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。