英文:
Spark 3.2 driver GC while reading JSON file, same works in spark 2.3
问题
I am running a simple file load script in the below code:
val files = fs.globStatus(path).filter{_.isDirectory == false} // 50 file each of ~100 mb
val fileMap = scala.collection.mutable.Map.empty[String, DataFrame]
files.par.foreach {f =>
val filename = f.getPath.getName.toString
val df = spark.read.option("mode", "DROPMALFORMED").json(f.getPath.toString)
fileMap.update(filename , df)
}
When run with a 4Gb driver on Spark 3.2, it throws an error for reaching GC Overhead. On the other hand, the same job when run in Spark 2.3 completes without any error. In Spark 2.3, the JVM dump shows around 2.3 GB usage, while in Spark 3.2, it goes beyond 4 GB and crashes.
All parameters remain the same for both scenarios.
JSON files contain a nested structure and have around ~800 columns.
In the above code, I am not calling any action. map.update
should store the reference of the dataframe. Why is the driver consuming memory although no action is called, and why the difference in behavior in the two versions?
英文:
I am running a simple file load script in the below code:
val files = fs.globStatus(path).filter{_.isDirectory == false} // 50 file each of ~100 mb
val fileMap = scala.collection.mutable.Map.empty[String, DataFrame]
files.par.foreach {f =>
val filename = f.getPath.getName.toString
val df = spark.read.option("mode", "DROPMALFORMED").json(f.getPath.toString)
fileMap.update(filename , df)
}
When run with 4Gb driver on spark 3.2 it throws error for reaching GC Overhead. On the other hand the same job when run in Spark 2.3 it completes without any error. In spark 2.3 the jvm dump shows around 2.3 GB usage while in Spark 3.2 it goes beyond 4 Gb and crashes.
All parameters remain same for both scenarios.
JSON file contains nested structure and have around ~800 cols.
In the above code I am not calling any action. map.update should store the reference of the dataframe. Why is driver consuming memory although no action is called and why the difference in behaviour in the two versions?
答案1
得分: 0
I found that it is actually the fileMap & parallel iterable objects that are occupying more than 50G of memory space.
I fixed this issue by removing the fileMap object and switched to JavaThreadPool for file.par as task support. Details in my medium blog.
英文:
On taking the GC dump of the driver, I found that it is actually the fileMap & parallel iterable objects that are occupying more than 50G of memory space.
I fixed this issue by removing fileMap object and switch to JavaThreadpool for file.par as task support. Details in my medium blog.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论