英文:
How to throw casting exception in Spark Dataset
问题
我正在通过Spark(Java)加载CSV文件:
Dataset<Row> dataset = sparkSession.read().option("header", "true").csv("/test.csv");
文件的模式如下:
dataset.printSchema();
root
|-- eid: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: string (nullable = true)
|-- designation: string (nullable = true)
这是示例数据:
dataset.show();
+-----+------------------+------------+-----------+
| eid| name| salary|designation|
+-----+------------------+------------+-----------+
| 1| "John"| "10000"| "SE"|
| 2| "Dan"| "100000"| "SE"|
| 3| "ironman"| "10000000"| "King"|
| 4| "Batman"| "100000000"| "Fighter"|
|awqwq| "captain america"| "300000"| "Captain"|
+-----+------------------+------------+-----------+
转换为整数类型:
dataset = dataset.withColumn("eid", dataset.col("eid").cast(DataTypes.IntegerType));
dataset.show();
+----+------------------+------------+-----------+
| eid| name| salary|designation|
+----+------------------+------------+-----------+
| 1| "John"| "10000"| "SE"|
| 2| "Dan"| "100000"| "SE"|
| 3| "ironman"| "10000000"| "King"|
| 4| "Batman"| "100000000"| "Fighter"|
|null| "captain america"| "300000"| "Captain"|
+----+------------------+------------+-----------+
但是在将值转换为eid列后,它们变成了null(字符串值)。它没有抛出任何转换异常。
是否有任何方法可以抛出异常。我有大量的列,需要抛出异常。
英文:
I'm loading csv file via Spark (java)
Dataset<Row> dataset = sparkSession.read().option("header", "true").csv("/test.csv");
This is the schema of the file :
dataset.printSchema();
root
|-- eid: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: string (nullable = true)
|-- designation: string (nullable = true)
This is the sample data :
dataset.show();
+-----+------------------+------------+-----------+
| eid| name| salary|designation|
+-----+------------------+------------+-----------+
| 1| "John"| "10000"| "SE"|
| 2| "Dan"| "100000"| "SE"|
| 3| "ironman"| "10000000"| "King"|
| 4| "Batman"| "100000000"| "Fighter"|
|awqwq| "captain america"| "300000"| "Captain"|
+-----+------------------+------------+-----------+
Casting to integer type
dataset = dataset.withColumn("eid", dataset.col("eid").cast(DataTypes.IntegerType));
dataset.show();
+----+------------------+------------+-----------+
| eid| name| salary|designation|
+----+------------------+------------+-----------+
| 1| "John"| "10000"| "SE"|
| 2| "Dan"| "100000"| "SE"|
| 3| "ironman"| "10000000"| "King"|
| 4| "Batman"| "100000000"| "Fighter"|
|null| "captain america"| "300000"| "Captain"|
+----+------------------+------------+-----------+
But after casting values in eid columns are becoming null (String values). it's not throwing any casting exception.
Is there any way exception can be thrown. I have huge number of columns and throwing exceptions are required
答案1
得分: 2
以下是翻译好的内容:
最流畅的方法可能是避免强制转换,而是使用预定义模式的 failfast 读取模式:
spark.read
.schema("eid INT, name STRING, salary STRING, designation STRING")
.option("mode", "FAILFAST")
.option("header", true)
.csv("/test.csv")
.show()
这会抛出:
org.apache.spark.SparkException: 在记录解析中检测到格式不正确的记录。解析模式:FAILFAST。要将格式不正确的记录处理为 null 结果,请尝试将选项 'mode' 设置为 'PERMISSIVE'。
如果由于某些原因无法使用此方法,可以使用数据集 API 进行任意转换和其他操作。此示例使用 Scala 编写,但也可以用 Java 编写:
spark.read
.option("header", true)
.csv("/test.csv")
.as[(String, String, String, String)]
.map {
case (eid, name, salary, designation) => (eid.toInt, name, salary, designation)
}
.show()
会抛出:
java.lang.NumberFormatException: 对于输入字符串:"awqwq"
或者,还可以使用 UDF。
英文:
Probably the most fluent way is to avoid casting and using failfast read mode with a pre-defined schema:
spark.read
.schema("eid INT, name STRING, salary STRING, designation STRING")
.option("mode", "FAILFAST")
.option("header", true)
.csv("/test.csv")
.show()
This throws:
org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
If this approach can't be used for some reason, arbitrary casting and other operations can be done using the dataset API. This example is in Scala but it could be written in Java as well:
spark.read
.option("header", true)
.csv("/test.csv")
.as[(String, String, String, String)]
.map {
case (eid, name, salary, designation) => (eid.toInt, name, salary, designation)
}
.show()
Throws
java.lang.NumberFormatException: For input string: "awqwq"
Alternatively, a UDF could be used as well.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论