2023年8月10日 22:20:12go评论159阅读模式

英文:

How to create a Spark UDF that returns a Tuple or updates two columns at the same time?

问题

Here's the modified code with the necessary changes to fix the error and achieve your expected resulting DataFrame:

import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;
import scala.Tuple2;
public class UDFExample {
    public static void main(String[] args) {
        // Spark session
        SparkSession spark = SparkSession.builder()
                .appName("UDFExample")
                .config("spark.master", "local")
                .getOrCreate();
        // Define the UDF
        spark.udf().register("calculateScore", new UDF2<Integer, Tuple2<Double, String>, Tuple2<Double, String>>() {
            public Tuple2<Double, String> call(Integer feature, Tuple2<Double, String> scoreValueAndReason) throws Exception {
                double newScoreValue = 0.5;
                String newScoreReason = "Null Issue";
                double scoreValue = scoreValueAndReason._1();
                String scoreReason = scoreValueAndReason._2();
                double updatedScoreValue = Math.min(scoreValue, newScoreValue);
                String updatedScoreReason = newScoreValue < scoreValue ? newScoreReason : scoreReason;
                return new Tuple2<>(updatedScoreValue, updatedScoreReason);
            }
        }, DataTypes.createStructType(new StructField[]{
                new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
                new StructField("_2", DataTypes.StringType, false, Metadata.empty())
        }));
        // Create DataFrame
        Dataset<Row> df = spark.createDataFrame(Arrays.asList(
                RowFactory.create(1, 2, 3, 4),
                RowFactory.create(2, 1, 4, 3),
                RowFactory.create(3, 4, 9, 2),
                RowFactory.create(4, 3, 2, 1)
        ), new StructType(new StructField[]{
                new StructField("A", DataTypes.IntegerType, true, Metadata.empty()),
                new StructField("B", DataTypes.IntegerType, true, Metadata.empty()),
                new StructField("C", DataTypes.IntegerType, true, Metadata.empty()),
                new StructField("D", DataTypes.IntegerType, true, Metadata.empty())
        }));
        df = df.withColumn("ScoreValueAndReason",
                functions.struct(functions.lit(1.0), functions.lit("No issues"))
        );
        df.show();
        // Apply the UDF to each column
        String[] columnNames = df.columns();
        for (String columnName : columnNames) {
            df = df.withColumn(columnName, functions.callUDF("calculateScore", col(columnName), col("ScoreValueAndReason")));
        }
        df.show();
    }
}

I've made the following changes to your code:

Enclosed the UDF registration within a call to spark.udf().register.
Created the UDF as an anonymous inner class and specified the input and output types explicitly.
Updated the UDF to work with the input types specified in your original code.
Applied the UDF to each column in the DataFrame within the loop.

Now, the code should work as expected without the "org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema incompatible with scala.Tuple2" error, and you should get the desired resulting DataFrame with updated columns.

英文:

I'm trying to create UDF in java that performs a computation by iterating over all columns in the dataset, calculates a score for each column and sets a flag with a specific string. I tried having a UDF that updates two columns at the same time but no success. So my current approach is to have a single column composed of a tuple that I plan on splitting into two in a subsequent step.

This is my attempt (example code):

public static void main(String[] args) {
// Spark session
SparkSession spark = SparkSession.builder()
.appName(&quot;UDFExample&quot;)
.config(&quot;spark.master&quot;, &quot;local&quot;)
.getOrCreate();
UDF2&lt;Integer, Tuple2&lt;Double, String&gt;, Tuple2&lt;Double, String&gt;&gt; calculateScore = (feature, ScoreValueAndReason) -&gt; {
double new_ScoreValue = 0.5;
String new_ScoreReason = &quot;Null Issue&quot;;
double ScoreValue = ScoreValueAndReason._1();
String ScoreReason = ScoreValueAndReason._2();
double updated_ScoreValue = Math.min(ScoreValue, new_ScoreValue);
String updated_ScoreReason = new_ScoreValue &lt; ScoreValue ? new_ScoreReason: ScoreReason;
return new Tuple2&lt;&gt;(updated_ScoreValue, updated_ScoreReason);
};
Dataset&lt;Row&gt; df = spark.createDataFrame(Arrays.asList(
RowFactory.create(1, 2, 3, 4),
RowFactory.create(2, 1, 4, 3),
RowFactory.create(3, 4, 9, 2),
RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
new StructField(&quot;A&quot;, DataTypes.IntegerType, true, Metadata.empty()),
new StructField(&quot;B&quot;, DataTypes.IntegerType, true, Metadata.empty()),
new StructField(&quot;C&quot;, DataTypes.IntegerType, true, Metadata.empty()),
new StructField(&quot;D&quot;, DataTypes.IntegerType, true, Metadata.empty())
}));
df = df.withColumn(&quot;ScoreValueAndReason&quot;,
functions.struct(functions.lit(1.0), functions.lit(&quot;No issues&quot;))
);
df.show();
// Register the UDF
spark.udf().register(&quot;calculateScore&quot;, calculateScore, DataTypes.createStructType(new StructField[]{
new StructField(&quot;_1&quot;, DataTypes.DoubleType, false, Metadata.empty()),
new StructField(&quot;_2&quot;, DataTypes.StringType, false, Metadata.empty())
}));
String[] columnNames = df.columns();
for (String columnName: columnNames) {
df = df.withColumn(columnName, functions.callUDF(&quot;calculateScore&quot;, col(columnName), col(&quot;ScoreValueAndReason&quot;)));
}
df.show();

I tried the implementation above as well as a slightly different udf, but I keep running into this error: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema incompatible with scala.Tuple2

I expect a resulting dataframe with either two new columns scoreeReason and scoreValue or (more like in the example I provided) a single column scoreReasonAndValue that contains the values updated by my udf.

答案1

得分: 0

查看这里。

UDFs不直接支持复杂类型，你必须强制使用Row。对于Scala，有https://github.com/gaborbarna/typedudf以及https://github.com/typelevel/frameless提供了正确派生和使用编码器以与Row一起工作的类型化UDF。

也就是说，如果你可以不使用SQL中的udf函数，那么你可以定义此助手函数，该函数使用编码器，适用于任何参数类型（在Spark 3.4.1上测试，ScalaUDF在其他版本上可能不同）：

public <A, B, R> BiFunction<Column, Column, Column> withEncoders(UDF2<A, B, R> udf2, String name, Encoder<A> encA, Encoder<B> encB, Encoder<R> encR, DataType returnType){
    DataType replaced = CharVarcharUtils.failIfHasCharVarchar(returnType);
    scala.Function2<A, B, R> func = new scala.runtime.AbstractFunction2<A, B, R>(){
        @Override
        public Object apply(Object v1, Object v2) {
            try {
                return ((UDF2<Object, Object, Object>) udf2).call(v1, v2);
            } catch (Exception e) {
                e.printStackTrace();
                return null;
            }
        }
    };
    return (Column a, Column b) -> {
        ScalaUDF udf = new ScalaUDF(func, replaced, Seq$.MODULE$.<Expression>newBuilder().$plus$eq(a.expr()).$plus$eq(b.expr()).result(),
                Seq$.MODULE$.<scala.Option<ExpressionEncoder<?>>>newBuilder().$plus$eq(Some$.MODULE$.apply((ExpressionEncoder<A>) encA)).
                        $plus$eq(Some$.MODULE$.apply((ExpressionEncoder<B>)encB)).result(),
                Some$.MODULE$.apply((ExpressionEncoder<R>) encR), Some$.MODULE$.apply(name), true, true);
        return new Column(udf);
    };
}

然后：

UDF2<Integer, Tuple2<Double, String>, Tuple2<Double, String>> calculateScoreRaw = (feature, ScoreValueAndReason) -> {
    double new_ScoreValue = 0.5;
    String new_ScoreReason = "Null Issue";
    double ScoreValue = ScoreValueAndReason._1();
    String ScoreReason = ScoreValueAndReason._2();
    double updated_ScoreValue = Math.min(ScoreValue, new_ScoreValue);
    String updated_ScoreReason = new_ScoreValue < ScoreValue ? new_ScoreReason: ScoreReason;
    return new Tuple2<>(updated_ScoreValue, updated_ScoreReason);
};
Encoder<Tuple2<Double, String>> dsTEncoder = Encoders.tuple(Encoders.DOUBLE(), Encoders.STRING());
BiFunction<Column, Column, Column> calculateScore = withEncoders(calculateScoreRaw, "calculateScore",
        Encoders.INT(), dsTEncoder, dsTEncoder
        , DataTypes.createStructType(new StructField[]{
        new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
        new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
Dataset<Row> df = spark.createDataFrame(Arrays.asList(
        RowFactory.create(1, 2, 3, 4),
        RowFactory.create(2, 1, 4, 3),
        RowFactory.create(3, 4, 9, 2),
        RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
        new StructField("A", DataTypes.IntegerType, true, Metadata.empty()),
        new StructField("B", DataTypes.IntegerType, true, Metadata.empty()),
        new StructField("C", DataTypes.IntegerType, true, Metadata.empty()),
        new StructField("D", DataTypes.IntegerType, true, Metadata.empty())
}));
String[] columnNames = df.columns();
df = df.withColumn("ScoreValueAndReason",
        functions.struct(functions.lit(1.0), functions.lit("No issues"))
);
df.show();
// 注册UDF
//spark.udf().register("calculateScore", calculateScore, );
for (String columnName: columnNames) {
    df = df.withColumn(columnName, calculateScore.apply( col(columnName), col("ScoreValueAndReason")));
}
df.show();

生成：

+-----------------+-----------------+-----------------+-----------------+-------------------+
|                A|                B|                C|                D|ScoreValueAndReason|
+-----------------+-----------------+-----------------+-----------------+-------------------+
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
+-----------------+-----------------+-----------------+-----------------+-------------------+

所以有些部分已从您原始的源代码中移动了（关键是columnNames也必须移动上去，否则它会不正确地包括ScoreValueAndReason）。

创建ScalaUDF制造器：

BiFunction<Column, Column, Column> calculateScore = withEncoders(calculateScoreRaw, "calculateScore",
        Encoders.INT(), dsTEncoder, dsTEncoder
        , DataTypes.createStructType(new StructField[]{
        new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
        new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));

传入了您定义的原始UDF，带有正确的编码器和返回类型。你也可以让这个函数在SQL中可调用，但这是另一个问题的答案。

英文:

See here.

UDFs aren't, out of the box, supporting complex types and you are forced to use Row directly. For Scala there was https://github.com/gaborbarna/typedudf and is https://github.com/typelevel/frameless providing typed udf's that derive and use encoders correctly to work with Row.

That said if you can live without using the udf function in sql then you can define this helper function using the encoders which will work for any param types (tested on Spark 3.4.1, ScalaUDF may be different on other versions):

public &lt;A, B, R&gt; BiFunction&lt;Column, Column, Column&gt; withEncoders(UDF2&lt;A, B, R&gt; udf2, String name, Encoder&lt;A&gt; encA, Encoder&lt;B&gt; encB, Encoder&lt;R&gt; encR, DataType returnType){
    DataType replaced = CharVarcharUtils.failIfHasCharVarchar(returnType);
    scala.Function2&lt;A, B, R&gt; func = new scala.runtime.AbstractFunction2&lt;A, B, R&gt;(){
        @Override
        public Object apply(Object v1, Object v2) {
            try {
                return ((UDF2&lt;Object, Object, Object&gt;) udf2).call(v1, v2);
            } catch (Exception e) {
                e.printStackTrace();
                return null;
            }
        }
    };
    return (Column a, Column b) -&gt; {
        ScalaUDF udf = new ScalaUDF(func, replaced, Seq$.MODULE$.&lt; Expression&gt;newBuilder().$plus$eq(a.expr()).$plus$eq(b.expr()).result(),
                Seq$.MODULE$.&lt; scala.Option&lt;ExpressionEncoder&lt;?&gt;&gt; &gt;newBuilder().$plus$eq(Some$.MODULE$.apply((ExpressionEncoder&lt;A&gt;) encA)).
                        $plus$eq(Some$.MODULE$.apply((ExpressionEncoder&lt;B&gt;)encB)).result(),
                Some$.MODULE$.apply((ExpressionEncoder&lt;R&gt;) encR), Some$.MODULE$.apply(name), true, true);
        return new Column(udf);
    };
}

Then:

UDF2&lt;Integer, Tuple2&lt;Double, String&gt;, Tuple2&lt;Double, String&gt;&gt; calculateScoreRaw = (feature, ScoreValueAndReason) -&gt; {
    double new_ScoreValue = 0.5;
    String new_ScoreReason = &quot;Null Issue&quot;;
    double ScoreValue = ScoreValueAndReason._1();
    String ScoreReason = ScoreValueAndReason._2();
    double updated_ScoreValue = Math.min(ScoreValue, new_ScoreValue);
    String updated_ScoreReason = new_ScoreValue &lt; ScoreValue ? new_ScoreReason: ScoreReason;
    return new Tuple2&lt;&gt;(updated_ScoreValue, updated_ScoreReason);
};
Encoder&lt;Tuple2&lt;Double, String&gt;&gt; dsTEncoder = Encoders.tuple(Encoders.DOUBLE(), Encoders.STRING());
BiFunction&lt;Column, Column, Column&gt; calculateScore = withEncoders(calculateScoreRaw, &quot;calculateScore&quot;,
        Encoders.INT(), dsTEncoder, dsTEncoder
        , DataTypes.createStructType(new StructField[]{
        new StructField(&quot;_1&quot;, DataTypes.DoubleType, false, Metadata.empty()),
        new StructField(&quot;_2&quot;, DataTypes.StringType, false, Metadata.empty())
}));
Dataset&lt;Row&gt; df = spark.createDataFrame(Arrays.asList(
        RowFactory.create(1, 2, 3, 4),
        RowFactory.create(2, 1, 4, 3),
        RowFactory.create(3, 4, 9, 2),
        RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
        new StructField(&quot;A&quot;, DataTypes.IntegerType, true, Metadata.empty()),
        new StructField(&quot;B&quot;, DataTypes.IntegerType, true, Metadata.empty()),
        new StructField(&quot;C&quot;, DataTypes.IntegerType, true, Metadata.empty()),
        new StructField(&quot;D&quot;, DataTypes.IntegerType, true, Metadata.empty())
}));
String[] columnNames = df.columns();
df = df.withColumn(&quot;ScoreValueAndReason&quot;,
        functions.struct(functions.lit(1.0), functions.lit(&quot;No issues&quot;))
);
df.show();
// Register the UDF
//spark.udf().register(&quot;calculateScore&quot;, calculateScore, );
for (String columnName: columnNames) {
    df = df.withColumn(columnName, calculateScore.apply( col(columnName), col(&quot;ScoreValueAndReason&quot;)));
}
df.show();

produces:

+-----------------+-----------------+-----------------+-----------------+-------------------+
|                A|                B|                C|                D|ScoreValueAndReason|
+-----------------+-----------------+-----------------+-----------------+-------------------+
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|   {1.0, No issues}|
+-----------------+-----------------+-----------------+-----------------+-------------------+

so a few bits have moved around from your original source (crucially the columnNames also has to move up otherwise it'll incorrectly include ScoreValueAndReason).

The creation of the ScalaUDF maker:

BiFunction&lt;Column, Column, Column&gt; calculateScore = withEncoders(calculateScoreRaw, &quot;calculateScore&quot;,
        Encoders.INT(), dsTEncoder, dsTEncoder
        , DataTypes.createStructType(new StructField[]{
        new StructField(&quot;_1&quot;, DataTypes.DoubleType, false, Metadata.empty()),
        new StructField(&quot;_2&quot;, DataTypes.StringType, false, Metadata.empty())
}));

passes in the original UDF you defined, with the correct encoders and the return type. You can also make this sql callable but that's another question's answer.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何创建一个返回元组或同时更新两列的Spark UDF？

问题

答案1

如何使用Java读取XML文件的所有子元素？

我如何创建一个 JSON 对象列表，其中所有的键都是唯一的 GUID？

PgBouncer throwing PSQLException: ERROR: unsupported pkt type: 80 when issuing query "SHOW POOLS" on "pgbouncer" database via JDBC

`java.security.InvalidKeyException`仅在旧版本的Java中在AES解密过程中抛出。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。