2020年7月27日 15:34:02go评论80阅读模式

英文:

How to throw casting exception in Spark Dataset

问题

我正在通过Spark（Java）加载CSV文件：

Dataset<Row> dataset = sparkSession.read().option("header", "true").csv("/test.csv");

文件的模式如下：

dataset.printSchema();
root
    |-- eid: string (nullable = true)
    |-- name: string (nullable = true)
    |-- salary: string (nullable = true)
    |-- designation: string (nullable = true)

这是示例数据：

dataset.show();

+-----+------------------+------------+-----------+
|  eid|              name|      salary|designation|
+-----+------------------+------------+-----------+
|    1|            "John"|     "10000"|       "SE"|
|    2|             "Dan"|    "100000"|       "SE"|
|    3|         "ironman"|  "10000000"|     "King"|
|    4|          "Batman"| "100000000"|  "Fighter"|
|awqwq| "captain america"|    "300000"|  "Captain"|
+-----+------------------+------------+-----------+

转换为整数类型：

dataset = dataset.withColumn("eid", dataset.col("eid").cast(DataTypes.IntegerType));
dataset.show();

+----+------------------+------------+-----------+
| eid|              name|      salary|designation|
+----+------------------+------------+-----------+
|   1|            "John"|     "10000"|       "SE"|
|   2|             "Dan"|    "100000"|       "SE"|
|   3|         "ironman"|  "10000000"|     "King"|
|   4|          "Batman"| "100000000"|  "Fighter"|
|null| "captain america"|    "300000"|  "Captain"|
+----+------------------+------------+-----------+

但是在将值转换为eid列后，它们变成了null（字符串值）。它没有抛出任何转换异常。

是否有任何方法可以抛出异常。我有大量的列，需要抛出异常。

英文:

I'm loading csv file via Spark (java)

Dataset&lt;Row&gt; dataset = sparkSession.read().option(&quot;header&quot;, &quot;true&quot;).csv(&quot;/test.csv&quot;);

This is the schema of the file :

dataset.printSchema();
root
|-- eid: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: string (nullable = true)
|-- designation: string (nullable = true)

This is the sample data :

dataset.show();
+-----+------------------+------------+-----------+
|  eid|              name|      salary|designation|
+-----+------------------+------------+-----------+
|    1|            &quot;John&quot;|     &quot;10000&quot;|       &quot;SE&quot;|
|    2|             &quot;Dan&quot;|    &quot;100000&quot;|       &quot;SE&quot;|
|    3|         &quot;ironman&quot;|  &quot;10000000&quot;|     &quot;King&quot;|
|    4|          &quot;Batman&quot;| &quot;100000000&quot;|  &quot;Fighter&quot;|
|awqwq| &quot;captain america&quot;|    &quot;300000&quot;|  &quot;Captain&quot;|
+-----+------------------+------------+-----------+

Casting to integer type

dataset = dataset.withColumn(&quot;eid&quot;, dataset.col(&quot;eid&quot;).cast(DataTypes.IntegerType));
dataset.show();
+----+------------------+------------+-----------+
| eid|              name|      salary|designation|
+----+------------------+------------+-----------+
|   1|            &quot;John&quot;|     &quot;10000&quot;|       &quot;SE&quot;|
|   2|             &quot;Dan&quot;|    &quot;100000&quot;|       &quot;SE&quot;|
|   3|         &quot;ironman&quot;|  &quot;10000000&quot;|     &quot;King&quot;|
|   4|          &quot;Batman&quot;| &quot;100000000&quot;|  &quot;Fighter&quot;|
|null| &quot;captain america&quot;|    &quot;300000&quot;|  &quot;Captain&quot;|
+----+------------------+------------+-----------+

But after casting values in eid columns are becoming null (String values). it's not throwing any casting exception.

Is there any way exception can be thrown. I have huge number of columns and throwing exceptions are required

答案1

得分: 2

以下是翻译好的内容：

最流畅的方法可能是避免强制转换，而是使用预定义模式的 failfast 读取模式：

spark.read
  .schema("eid INT, name STRING, salary STRING, designation STRING")
  .option("mode", "FAILFAST")
  .option("header", true)
  .csv("/test.csv")
  .show()

这会抛出：

org.apache.spark.SparkException: 在记录解析中检测到格式不正确的记录。解析模式：FAILFAST。要将格式不正确的记录处理为 null 结果，请尝试将选项 'mode' 设置为 'PERMISSIVE'。

如果由于某些原因无法使用此方法，可以使用数据集 API 进行任意转换和其他操作。此示例使用 Scala 编写，但也可以用 Java 编写：

spark.read
  .option("header", true)
  .csv("/test.csv")
  .as[(String, String, String, String)]
  .map {
    case (eid, name, salary, designation) => (eid.toInt, name, salary, designation)
  }
  .show()

会抛出：

java.lang.NumberFormatException: 对于输入字符串："awqwq"

或者，还可以使用 UDF。

英文:

Probably the most fluent way is to avoid casting and using failfast read mode with a pre-defined schema:

spark.read
.schema(&quot;eid INT, name STRING, salary STRING, designation STRING&quot;)
.option(&quot;mode&quot;, &quot;FAILFAST&quot;)
.option(&quot;header&quot;, true)
.csv(&quot;/test.csv&quot;)
.show()

This throws:

org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option &#39;mode&#39; as &#39;PERMISSIVE&#39;.

If this approach can't be used for some reason, arbitrary casting and other operations can be done using the dataset API. This example is in Scala but it could be written in Java as well:

spark.read
.option(&quot;header&quot;, true)
.csv(&quot;/test.csv&quot;)
.as[(String, String, String, String)]
.map {
case (eid, name, salary, designation) =&gt; (eid.toInt, name, salary, designation)
}
.show()

Throws

java.lang.NumberFormatException: For input string: &quot;awqwq&quot;

Alternatively, a UDF could be used as well.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Spark Dataset中引发转换异常

问题

答案1

更改任务状态 – 最佳实践

检测耳机点击并在安卓上打开一个活动

连接一个 Flux 到一个 Mono

OpenJDK Panama Vector API jdk.incubator.vector not giving improved results for Vector dot product

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论