2023年2月14日 22:02:57go评论68阅读模式

英文:

An exception occurs when spark converts a json string to a HashMap in spark

问题

本地环境没有问题，但在执行 Spark 提交时出现异常。

大致代码如下

class Test extends Serializable {
def action() = {
val sc = SparkContext.getOrCreate(sparkConf)
val rdd1 = sc.textFile(.. )
val rdd2 = rdd1.map ( logline => {
//gson
val jsonObject = jsonParser.parse(logLine).getAsJsonObject
//jackson
val jsonObject = objectMapper.readValue(logLine,classOf[HashMap[String,String]])
MyDataSet ( parsedJson.get("field1"), parsedJson.get("field2"),...)
}
}
}

异常

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
ensureSerializable(ClosureCleanser.scala:444)..
........
........
caused by: java.io.NotSerializableException : com.fasterxml.jackson.module.scala.modifiers.ScalaTypeModifier


我已经使用了 gson 和 jackson 库。

仅通过继承 Serializable 就能解决这个问题吗？

英文:

there is no problem in the local environment, but a exception occur when performing spark submit.

The approximate code is as follows

class Test extends Serializable {
     def action() = {
         val sc = SparkContext.getOrCreate(sparkConf)
         val rdd1 = sc.textFile(.. )
         val rdd2 = rdd1.map ( logline =&gt; {
             //gson
             val jsonObject  jsonParser.parse(logLine).getAsJsonObject
             //jackson
             val jsonObject = objectMapper.readValue(logLine,classOf[HashMap[String,String]])
             MyDataSet ( parsedJson.get(&quot;field1&quot;), parsedJson.get(&quot;field2&quot;),...)              
         }                                  
     }
}

Exception

Exception in thread &quot;main&quot; org.apache.spark.SparkException: Task not serializable
               ensureSerializable(ClosureCleanser. scala:444)..
               ........
               ........
               caused by: java.io.NotSerializableException : com.fasterxml.jackson.module.scala.modifiers.ScalaTypeModifier

I have used both gson and jackson libraries.

Isn't this a problem that can be solved just by inheriting from serializable ?

答案1

得分: 1

NotSerializableException异常很好理解。您的任务不可序列化。Spark是一个并行计算引擎。驱动程序（执行主程序的地方）将您想要对RDD进行的转换（在map函数中编写的代码）发送到执行器，在那里执行它们。因此，这些转换需要是可序列化的。在您的情况下，jsonParser和objectMapper是在驱动程序上创建的。为了在转换中使用它们，Spark尝试对它们进行序列化，但失败了，因为它们不可序列化。这就是您的错误。我不知道哪个不可序列化，也许两者都是。

让我们举个例子，看看我们如何解决这个问题。

// 让我们创建一个不可序列化的类
class Stuff(val i: Int) {
    def get() = i
}
// 我们在驱动程序中实例化它
val stuff = new Stuff(4)

// 这会失败 "Caused by: java.io.NotSerializableException: Stuff"
val result = sc.parallelize(Seq(1, 2, 3)).map(x => (x, stuff.get)).collect

为了修复它，让我们在转换中创建对象

val result = sc.parallelize(Seq(1, 2, 3))
    .map(x => {
        val new_stuff = new Stuff(4)
        (x, new_stuff.get)
    }).collect

它能工作，但显然，为每个记录创建对象可能相当昂贵。我们可以通过mapPartition更好地处理，并且只在每个分区中创建一次对象：

val result = sc.parallelize(Seq(1, 2, 3))
    .mapPartitions(part => {
         val new_stuff = new Stuff(4)
         part.map(x => (x, new_stuff.get))
    }).collect

英文:

The exception NotSerializableException is pretty self explanatory. Your task is not serializable. Spark is a parallel computing engine. The driver (where your main program is executed) ships the transformations you want to make on the RDD (the code written inside map functions) to executors where they are executed. Therefore those transformations need to be serializable. In your case, jsonParser and objectMapper are created on the driver. To use them inside a transformation, spark tries to serialize them and fails because they are not serializable. That's your error. I don't know which one is not serializable, maybe both.

Let's take an example and see how we can fix it.

// let&#39;s create a non serializable class
class Stuff(val i : Int) {
    def get() = i
}
// we instantiate it in the driver
val stuff = new Stuff(4)

//this fails &quot;Caused by: java.io.NotSerializableException: Stuff&quot;
val result = sc.parallelize(Seq(1, 2,3)).map( x =&gt; (x, stuff.get)).collect

To fix it let's create the object inside the transformation

val result = sc.parallelize(Seq(1, 2,3))
    .map( x =&gt; {
        val new_stuff = new Stuff(4)
        (x, new_stuff.get)
    }).collect

It works, but obviously, creating the object for every record can be quite expensive. We can do better with mapPartition and create the object only once per partition:

val result = sc.parallelize(Seq(1, 2,3))
    .mapPartitions(part =&gt; {
         val new_stuff = new Stuff(4)
         part.map( x =&gt; (x, new_stuff.get))
    }).collect

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

一个异常发生在 Spark 将 JSON 字符串转换为 HashMap 时。

问题

答案1

数据帧在经常使用的筛选列上重新分区如何在Spark中有所帮助？

LogicApp – 更改 JSON

Python3处理嵌套JSON响应的方法是什么？

将Hive和Hadoop的JAR包添加到IntelliJ项目。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论