2023年3月31日 02:02:46go评论78阅读模式

英文:

Spark 3.2 driver GC while reading JSON file, same works in spark 2.3

问题

I am running a simple file load script in the below code:

val files = fs.globStatus(path).filter{_.isDirectory == false} // 50 file each of ~100 mb 
val fileMap = scala.collection.mutable.Map.empty[String, DataFrame] 
files.par.foreach {f =>
  val filename = f.getPath.getName.toString
  val df = spark.read.option("mode", "DROPMALFORMED").json(f.getPath.toString)
  fileMap.update(filename , df)
}

When run with a 4Gb driver on Spark 3.2, it throws an error for reaching GC Overhead. On the other hand, the same job when run in Spark 2.3 completes without any error. In Spark 2.3, the JVM dump shows around 2.3 GB usage, while in Spark 3.2, it goes beyond 4 GB and crashes.

All parameters remain the same for both scenarios.

JSON files contain a nested structure and have around ~800 columns.

In the above code, I am not calling any action. map.update should store the reference of the dataframe. Why is the driver consuming memory although no action is called, and why the difference in behavior in the two versions?

英文:

I am running a simple file load script in the below code:

val files = fs.globStatus(path).filter{_.isDirectory == false} // 50 file each of ~100 mb 
val fileMap = scala.collection.mutable.Map.empty[String, DataFrame] 
files.par.foreach {f =&gt;
  val filename = f.getPath.getName.toString
  val df = spark.read.option(&quot;mode&quot;, &quot;DROPMALFORMED&quot;).json(f.getPath.toString)
  fileMap.update(filename , df)
}

When run with 4Gb driver on spark 3.2 it throws error for reaching GC Overhead. On the other hand the same job when run in Spark 2.3 it completes without any error. In spark 2.3 the jvm dump shows around 2.3 GB usage while in Spark 3.2 it goes beyond 4 Gb and crashes.

All parameters remain same for both scenarios.

JSON file contains nested structure and have around ~800 cols.

In the above code I am not calling any action. map.update should store the reference of the dataframe. Why is driver consuming memory although no action is called and why the difference in behaviour in the two versions?

答案1

得分: 0

I found that it is actually the fileMap & parallel iterable objects that are occupying more than 50G of memory space.

I fixed this issue by removing the fileMap object and switched to JavaThreadPool for file.par as task support. Details in my medium blog.

英文:

On taking the GC dump of the driver, I found that it is actually the fileMap & parallel iterable objects that are occupying more than 50G of memory space.

I fixed this issue by removing fileMap object and switch to JavaThreadpool for file.par as task support. Details in my medium blog.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark 3.2读取JSON文件时出现驱动程序GC问题，而在Spark 2.3中正常工作。

问题

答案1

Apache Spark spark-submit k8s API https ERROR

在DataBricks中，将一个R数据框转换为Spark数据框是否有大小限制？

TypeError in pySpark UDF functions

如何检测Spark Graphframes中的循环？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。