Spark 3.2读取JSON文件时出现驱动程序GC问题,而在Spark 2.3中正常工作。

huangapple go评论45阅读模式
英文:

Spark 3.2 driver GC while reading JSON file, same works in spark 2.3

问题

I am running a simple file load script in the below code:

val files = fs.globStatus(path).filter{_.isDirectory == false} // 50 file each of ~100 mb 
val fileMap = scala.collection.mutable.Map.empty[String, DataFrame] 
files.par.foreach {f =>
  val filename = f.getPath.getName.toString
  val df = spark.read.option("mode", "DROPMALFORMED").json(f.getPath.toString)
  fileMap.update(filename , df)
}

When run with a 4Gb driver on Spark 3.2, it throws an error for reaching GC Overhead. On the other hand, the same job when run in Spark 2.3 completes without any error. In Spark 2.3, the JVM dump shows around 2.3 GB usage, while in Spark 3.2, it goes beyond 4 GB and crashes.

All parameters remain the same for both scenarios.

JSON files contain a nested structure and have around ~800 columns.

In the above code, I am not calling any action. map.update should store the reference of the dataframe. Why is the driver consuming memory although no action is called, and why the difference in behavior in the two versions?

英文:

I am running a simple file load script in the below code:

val files = fs.globStatus(path).filter{_.isDirectory == false} // 50 file each of ~100 mb 
val fileMap = scala.collection.mutable.Map.empty[String, DataFrame] 
files.par.foreach {f =>
  val filename = f.getPath.getName.toString
  val df = spark.read.option("mode", "DROPMALFORMED").json(f.getPath.toString)
  fileMap.update(filename , df)
}

When run with 4Gb driver on spark 3.2 it throws error for reaching GC Overhead. On the other hand the same job when run in Spark 2.3 it completes without any error. In spark 2.3 the jvm dump shows around 2.3 GB usage while in Spark 3.2 it goes beyond 4 Gb and crashes.

All parameters remain same for both scenarios.

JSON file contains nested structure and have around ~800 cols.

In the above code I am not calling any action. map.update should store the reference of the dataframe. Why is driver consuming memory although no action is called and why the difference in behaviour in the two versions?

答案1

得分: 0

I found that it is actually the fileMap & parallel iterable objects that are occupying more than 50G of memory space.

I fixed this issue by removing the fileMap object and switched to JavaThreadPool for file.par as task support. Details in my medium blog.

英文:

On taking the GC dump of the driver, I found that it is actually the fileMap & parallel iterable objects that are occupying more than 50G of memory space.
Spark 3.2读取JSON文件时出现驱动程序GC问题,而在Spark 2.3中正常工作。

I fixed this issue by removing fileMap object and switch to JavaThreadpool for file.par as task support. Details in my medium blog.

huangapple
  • 本文由 发表于 2023年3月31日 02:02:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75891534.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定