如何在Spark Dataset中引发转换异常

huangapple go评论72阅读模式
英文:

How to throw casting exception in Spark Dataset

问题

我正在通过SparkJava加载CSV文件

Dataset<Row> dataset = sparkSession.read().option("header", "true").csv("/test.csv");

文件的模式如下

dataset.printSchema();
root
    |-- eid: string (nullable = true)
    |-- name: string (nullable = true)
    |-- salary: string (nullable = true)
    |-- designation: string (nullable = true)

这是示例数据

dataset.show();

+-----+------------------+------------+-----------+
|  eid|              name|      salary|designation|
+-----+------------------+------------+-----------+
|    1|            "John"|     "10000"|       "SE"|
|    2|             "Dan"|    "100000"|       "SE"|
|    3|         "ironman"|  "10000000"|     "King"|
|    4|          "Batman"| "100000000"|  "Fighter"|
|awqwq| "captain america"|    "300000"|  "Captain"|
+-----+------------------+------------+-----------+

转换为整数类型

dataset = dataset.withColumn("eid", dataset.col("eid").cast(DataTypes.IntegerType));
dataset.show();

+----+------------------+------------+-----------+
| eid|              name|      salary|designation|
+----+------------------+------------+-----------+
|   1|            "John"|     "10000"|       "SE"|
|   2|             "Dan"|    "100000"|       "SE"|
|   3|         "ironman"|  "10000000"|     "King"|
|   4|          "Batman"| "100000000"|  "Fighter"|
|null| "captain america"|    "300000"|  "Captain"|
+----+------------------+------------+-----------+

但是在将值转换为eid列后它们变成了null字符串值)。它没有抛出任何转换异常

是否有任何方法可以抛出异常我有大量的列需要抛出异常
英文:

I'm loading csv file via Spark (java)

Dataset&lt;Row&gt; dataset = sparkSession.read().option(&quot;header&quot;, &quot;true&quot;).csv(&quot;/test.csv&quot;);

This is the schema of the file :

dataset.printSchema();
root
|-- eid: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: string (nullable = true)
|-- designation: string (nullable = true)

This is the sample data :

dataset.show();
+-----+------------------+------------+-----------+
|  eid|              name|      salary|designation|
+-----+------------------+------------+-----------+
|    1|            &quot;John&quot;|     &quot;10000&quot;|       &quot;SE&quot;|
|    2|             &quot;Dan&quot;|    &quot;100000&quot;|       &quot;SE&quot;|
|    3|         &quot;ironman&quot;|  &quot;10000000&quot;|     &quot;King&quot;|
|    4|          &quot;Batman&quot;| &quot;100000000&quot;|  &quot;Fighter&quot;|
|awqwq| &quot;captain america&quot;|    &quot;300000&quot;|  &quot;Captain&quot;|
+-----+------------------+------------+-----------+

Casting to integer type

dataset = dataset.withColumn(&quot;eid&quot;, dataset.col(&quot;eid&quot;).cast(DataTypes.IntegerType));
dataset.show();
+----+------------------+------------+-----------+
| eid|              name|      salary|designation|
+----+------------------+------------+-----------+
|   1|            &quot;John&quot;|     &quot;10000&quot;|       &quot;SE&quot;|
|   2|             &quot;Dan&quot;|    &quot;100000&quot;|       &quot;SE&quot;|
|   3|         &quot;ironman&quot;|  &quot;10000000&quot;|     &quot;King&quot;|
|   4|          &quot;Batman&quot;| &quot;100000000&quot;|  &quot;Fighter&quot;|
|null| &quot;captain america&quot;|    &quot;300000&quot;|  &quot;Captain&quot;|
+----+------------------+------------+-----------+

But after casting values in eid columns are becoming null (String values). it's not throwing any casting exception.

Is there any way exception can be thrown. I have huge number of columns and throwing exceptions are required

答案1

得分: 2

以下是翻译好的内容:

最流畅的方法可能是避免强制转换,而是使用预定义模式的 failfast 读取模式:

spark.read
  .schema("eid INT, name STRING, salary STRING, designation STRING")
  .option("mode", "FAILFAST")
  .option("header", true)
  .csv("/test.csv")
  .show()

这会抛出:

org.apache.spark.SparkException: 在记录解析中检测到格式不正确的记录解析模式FAILFAST要将格式不正确的记录处理为 null 结果请尝试将选项 'mode' 设置为 'PERMISSIVE'

如果由于某些原因无法使用此方法,可以使用数据集 API 进行任意转换和其他操作。此示例使用 Scala 编写,但也可以用 Java 编写:

spark.read
  .option("header", true)
  .csv("/test.csv")
  .as[(String, String, String, String)]
  .map {
    case (eid, name, salary, designation) => (eid.toInt, name, salary, designation)
  }
  .show()

会抛出:

java.lang.NumberFormatException: 对于输入字符串"awqwq"

或者,还可以使用 UDF。

英文:

Probably the most fluent way is to avoid casting and using failfast read mode with a pre-defined schema:

spark.read
.schema(&quot;eid INT, name STRING, salary STRING, designation STRING&quot;)
.option(&quot;mode&quot;, &quot;FAILFAST&quot;)
.option(&quot;header&quot;, true)
.csv(&quot;/test.csv&quot;)
.show()

This throws:

org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option &#39;mode&#39; as &#39;PERMISSIVE&#39;.

If this approach can't be used for some reason, arbitrary casting and other operations can be done using the dataset API. This example is in Scala but it could be written in Java as well:

spark.read
.option(&quot;header&quot;, true)
.csv(&quot;/test.csv&quot;)
.as[(String, String, String, String)]
.map {
case (eid, name, salary, designation) =&gt; (eid.toInt, name, salary, designation)
}
.show()

Throws

java.lang.NumberFormatException: For input string: &quot;awqwq&quot;

Alternatively, a UDF could be used as well.

huangapple
  • 本文由 发表于 2020年7月27日 15:34:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/63110678.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定