英文:
How to create a Spark UDF that returns a Tuple or updates two columns at the same time?
问题
Here's the modified code with the necessary changes to fix the error and achieve your expected resulting DataFrame:
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;
import scala.Tuple2;
public class UDFExample {
public static void main(String[] args) {
// Spark session
SparkSession spark = SparkSession.builder()
.appName("UDFExample")
.config("spark.master", "local")
.getOrCreate();
// Define the UDF
spark.udf().register("calculateScore", new UDF2<Integer, Tuple2<Double, String>, Tuple2<Double, String>>() {
public Tuple2<Double, String> call(Integer feature, Tuple2<Double, String> scoreValueAndReason) throws Exception {
double newScoreValue = 0.5;
String newScoreReason = "Null Issue";
double scoreValue = scoreValueAndReason._1();
String scoreReason = scoreValueAndReason._2();
double updatedScoreValue = Math.min(scoreValue, newScoreValue);
String updatedScoreReason = newScoreValue < scoreValue ? newScoreReason : scoreReason;
return new Tuple2<>(updatedScoreValue, updatedScoreReason);
}
}, DataTypes.createStructType(new StructField[]{
new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
// Create DataFrame
Dataset<Row> df = spark.createDataFrame(Arrays.asList(
RowFactory.create(1, 2, 3, 4),
RowFactory.create(2, 1, 4, 3),
RowFactory.create(3, 4, 9, 2),
RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
new StructField("A", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("B", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("C", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("D", DataTypes.IntegerType, true, Metadata.empty())
}));
df = df.withColumn("ScoreValueAndReason",
functions.struct(functions.lit(1.0), functions.lit("No issues"))
);
df.show();
// Apply the UDF to each column
String[] columnNames = df.columns();
for (String columnName : columnNames) {
df = df.withColumn(columnName, functions.callUDF("calculateScore", col(columnName), col("ScoreValueAndReason")));
}
df.show();
}
}
I've made the following changes to your code:
- Enclosed the UDF registration within a call to
spark.udf().register
. - Created the UDF as an anonymous inner class and specified the input and output types explicitly.
- Updated the UDF to work with the input types specified in your original code.
- Applied the UDF to each column in the DataFrame within the loop.
Now, the code should work as expected without the "org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema incompatible with scala.Tuple2" error, and you should get the desired resulting DataFrame with updated columns.
英文:
I'm trying to create UDF in java that performs a computation by iterating over all columns in the dataset, calculates a score for each column and sets a flag with a specific string. I tried having a UDF that updates two columns at the same time but no success. So my current approach is to have a single column composed of a tuple that I plan on splitting into two in a subsequent step.
This is my attempt (example code):
public static void main(String[] args) {
// Spark session
SparkSession spark = SparkSession.builder()
.appName("UDFExample")
.config("spark.master", "local")
.getOrCreate();
UDF2<Integer, Tuple2<Double, String>, Tuple2<Double, String>> calculateScore = (feature, ScoreValueAndReason) -> {
double new_ScoreValue = 0.5;
String new_ScoreReason = "Null Issue";
double ScoreValue = ScoreValueAndReason._1();
String ScoreReason = ScoreValueAndReason._2();
double updated_ScoreValue = Math.min(ScoreValue, new_ScoreValue);
String updated_ScoreReason = new_ScoreValue < ScoreValue ? new_ScoreReason: ScoreReason;
return new Tuple2<>(updated_ScoreValue, updated_ScoreReason);
};
Dataset<Row> df = spark.createDataFrame(Arrays.asList(
RowFactory.create(1, 2, 3, 4),
RowFactory.create(2, 1, 4, 3),
RowFactory.create(3, 4, 9, 2),
RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
new StructField("A", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("B", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("C", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("D", DataTypes.IntegerType, true, Metadata.empty())
}));
df = df.withColumn("ScoreValueAndReason",
functions.struct(functions.lit(1.0), functions.lit("No issues"))
);
df.show();
// Register the UDF
spark.udf().register("calculateScore", calculateScore, DataTypes.createStructType(new StructField[]{
new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
String[] columnNames = df.columns();
for (String columnName: columnNames) {
df = df.withColumn(columnName, functions.callUDF("calculateScore", col(columnName), col("ScoreValueAndReason")));
}
df.show();
I tried the implementation above as well as a slightly different udf, but I keep running into this error: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema incompatible with scala.Tuple2
I expect a resulting dataframe with either two new columns scoreeReason and scoreValue or (more like in the example I provided) a single column scoreReasonAndValue that contains the values updated by my udf.
答案1
得分: 0
查看这里。
UDFs不直接支持复杂类型,你必须强制使用Row。对于Scala,有https://github.com/gaborbarna/typedudf以及https://github.com/typelevel/frameless提供了正确派生和使用编码器以与Row一起工作的类型化UDF。
也就是说,如果你可以不使用SQL中的udf函数,那么你可以定义此助手函数,该函数使用编码器,适用于任何参数类型(在Spark 3.4.1上测试,ScalaUDF在其他版本上可能不同):
public <A, B, R> BiFunction<Column, Column, Column> withEncoders(UDF2<A, B, R> udf2, String name, Encoder<A> encA, Encoder<B> encB, Encoder<R> encR, DataType returnType){
DataType replaced = CharVarcharUtils.failIfHasCharVarchar(returnType);
scala.Function2<A, B, R> func = new scala.runtime.AbstractFunction2<A, B, R>(){
@Override
public Object apply(Object v1, Object v2) {
try {
return ((UDF2<Object, Object, Object>) udf2).call(v1, v2);
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
};
return (Column a, Column b) -> {
ScalaUDF udf = new ScalaUDF(func, replaced, Seq$.MODULE$.<Expression>newBuilder().$plus$eq(a.expr()).$plus$eq(b.expr()).result(),
Seq$.MODULE$.<scala.Option<ExpressionEncoder<?>>>newBuilder().$plus$eq(Some$.MODULE$.apply((ExpressionEncoder<A>) encA)).
$plus$eq(Some$.MODULE$.apply((ExpressionEncoder<B>)encB)).result(),
Some$.MODULE$.apply((ExpressionEncoder<R>) encR), Some$.MODULE$.apply(name), true, true);
return new Column(udf);
};
}
然后:
UDF2<Integer, Tuple2<Double, String>, Tuple2<Double, String>> calculateScoreRaw = (feature, ScoreValueAndReason) -> {
double new_ScoreValue = 0.5;
String new_ScoreReason = "Null Issue";
double ScoreValue = ScoreValueAndReason._1();
String ScoreReason = ScoreValueAndReason._2();
double updated_ScoreValue = Math.min(ScoreValue, new_ScoreValue);
String updated_ScoreReason = new_ScoreValue < ScoreValue ? new_ScoreReason: ScoreReason;
return new Tuple2<>(updated_ScoreValue, updated_ScoreReason);
};
Encoder<Tuple2<Double, String>> dsTEncoder = Encoders.tuple(Encoders.DOUBLE(), Encoders.STRING());
BiFunction<Column, Column, Column> calculateScore = withEncoders(calculateScoreRaw, "calculateScore",
Encoders.INT(), dsTEncoder, dsTEncoder
, DataTypes.createStructType(new StructField[]{
new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
Dataset<Row> df = spark.createDataFrame(Arrays.asList(
RowFactory.create(1, 2, 3, 4),
RowFactory.create(2, 1, 4, 3),
RowFactory.create(3, 4, 9, 2),
RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
new StructField("A", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("B", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("C", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("D", DataTypes.IntegerType, true, Metadata.empty())
}));
String[] columnNames = df.columns();
df = df.withColumn("ScoreValueAndReason",
functions.struct(functions.lit(1.0), functions.lit("No issues"))
);
df.show();
// 注册UDF
//spark.udf().register("calculateScore", calculateScore, );
for (String columnName: columnNames) {
df = df.withColumn(columnName, calculateScore.apply( col(columnName), col("ScoreValueAndReason")));
}
df.show();
生成:
+-----------------+-----------------+-----------------+-----------------+-------------------+
| A| B| C| D|ScoreValueAndReason|
+-----------------+-----------------+-----------------+-----------------+-------------------+
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
+-----------------+-----------------+-----------------+-----------------+-------------------+
所以有些部分已从您原始的源代码中移动了(关键是columnNames也必须移动上去,否则它会不正确地包括ScoreValueAndReason)。
创建ScalaUDF制造器:
BiFunction<Column, Column, Column> calculateScore = withEncoders(calculateScoreRaw, "calculateScore",
Encoders.INT(), dsTEncoder, dsTEncoder
, DataTypes.createStructType(new StructField[]{
new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
传入了您定义的原始UDF,带有正确的编码器和返回类型。你也可以让这个函数在SQL中可调用,但这是另一个问题的答案。
英文:
See here.
UDFs aren't, out of the box, supporting complex types and you are forced to use Row directly. For Scala there was https://github.com/gaborbarna/typedudf and is https://github.com/typelevel/frameless providing typed udf's that derive and use encoders correctly to work with Row.
That said if you can live without using the udf function in sql then you can define this helper function using the encoders which will work for any param types (tested on Spark 3.4.1, ScalaUDF may be different on other versions):
public <A, B, R> BiFunction<Column, Column, Column> withEncoders(UDF2<A, B, R> udf2, String name, Encoder<A> encA, Encoder<B> encB, Encoder<R> encR, DataType returnType){
DataType replaced = CharVarcharUtils.failIfHasCharVarchar(returnType);
scala.Function2<A, B, R> func = new scala.runtime.AbstractFunction2<A, B, R>(){
@Override
public Object apply(Object v1, Object v2) {
try {
return ((UDF2<Object, Object, Object>) udf2).call(v1, v2);
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
};
return (Column a, Column b) -> {
ScalaUDF udf = new ScalaUDF(func, replaced, Seq$.MODULE$.< Expression>newBuilder().$plus$eq(a.expr()).$plus$eq(b.expr()).result(),
Seq$.MODULE$.< scala.Option<ExpressionEncoder<?>> >newBuilder().$plus$eq(Some$.MODULE$.apply((ExpressionEncoder<A>) encA)).
$plus$eq(Some$.MODULE$.apply((ExpressionEncoder<B>)encB)).result(),
Some$.MODULE$.apply((ExpressionEncoder<R>) encR), Some$.MODULE$.apply(name), true, true);
return new Column(udf);
};
}
Then:
UDF2<Integer, Tuple2<Double, String>, Tuple2<Double, String>> calculateScoreRaw = (feature, ScoreValueAndReason) -> {
double new_ScoreValue = 0.5;
String new_ScoreReason = "Null Issue";
double ScoreValue = ScoreValueAndReason._1();
String ScoreReason = ScoreValueAndReason._2();
double updated_ScoreValue = Math.min(ScoreValue, new_ScoreValue);
String updated_ScoreReason = new_ScoreValue < ScoreValue ? new_ScoreReason: ScoreReason;
return new Tuple2<>(updated_ScoreValue, updated_ScoreReason);
};
Encoder<Tuple2<Double, String>> dsTEncoder = Encoders.tuple(Encoders.DOUBLE(), Encoders.STRING());
BiFunction<Column, Column, Column> calculateScore = withEncoders(calculateScoreRaw, "calculateScore",
Encoders.INT(), dsTEncoder, dsTEncoder
, DataTypes.createStructType(new StructField[]{
new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
Dataset<Row> df = spark.createDataFrame(Arrays.asList(
RowFactory.create(1, 2, 3, 4),
RowFactory.create(2, 1, 4, 3),
RowFactory.create(3, 4, 9, 2),
RowFactory.create(4, 3, 2, 1)
), new StructType(new StructField[]{
new StructField("A", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("B", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("C", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("D", DataTypes.IntegerType, true, Metadata.empty())
}));
String[] columnNames = df.columns();
df = df.withColumn("ScoreValueAndReason",
functions.struct(functions.lit(1.0), functions.lit("No issues"))
);
df.show();
// Register the UDF
//spark.udf().register("calculateScore", calculateScore, );
for (String columnName: columnNames) {
df = df.withColumn(columnName, calculateScore.apply( col(columnName), col("ScoreValueAndReason")));
}
df.show();
produces:
+-----------------+-----------------+-----------------+-----------------+-------------------+
| A| B| C| D|ScoreValueAndReason|
+-----------------+-----------------+-----------------+-----------------+-------------------+
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}|{0.5, Null Issue}| {1.0, No issues}|
+-----------------+-----------------+-----------------+-----------------+-------------------+
so a few bits have moved around from your original source (crucially the columnNames also has to move up otherwise it'll incorrectly include ScoreValueAndReason).
The creation of the ScalaUDF maker:
BiFunction<Column, Column, Column> calculateScore = withEncoders(calculateScoreRaw, "calculateScore",
Encoders.INT(), dsTEncoder, dsTEncoder
, DataTypes.createStructType(new StructField[]{
new StructField("_1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("_2", DataTypes.StringType, false, Metadata.empty())
}));
passes in the original UDF you defined, with the correct encoders and the return type. You can also make this sql callable but that's another question's answer.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论