2023年4月1日 01:03:29go评论81阅读模式

英文:

Memory issues running spark locally in Intellij (scala)

问题

I'm very new to Scala and Spark. I've been trying to accomplish a script that reads several of the same format excel files (separated by year: e.g. 2011.xlsx, 2012.xlsx, etc) into one dataframe. The total amount of data to be read into the dataframe is a peace-meal 350mb. Each file is approximately 30mb and there are roughly 12 files. However, I keep running into java.lang.OutOfMemoryErrors like below:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RemoteBlock-temp-file-clean-thread"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Executor task launch worker for task 0.0 in stage 0.0 (TID 0)"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "executor-kill-mark-cleanup"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Executor task launch worker for task 8.0 in stage 0.0 (TID 8)"
java.lang.OutOfMemoryError: Java heap space

I am running this code locally using Intellij IDEA:

import com.crealytics.spark.excel._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.{DataFrame, SparkSession, types}
import java.io File
object sparkJob extends App {
  val session = SparkSession.builder().
    config("spark.driver.bindAddress", "127.0.0.1").
    config("spark.executor.memory", "8g").
    config("spark.driver.memory", "8g").
    config("spark.memory.offHeap.enabled", true).
    config("spark.memory.offHeap.size", "4g").
    master("local[*]").
    appName("etl").
    getOrCreate()
  val dataSchema = types.StructType(Array(
    StructField("Delivery Date", types.StringType, nullable = false),
    StructField("Delivery Hour", types.IntegerType, nullable = false),
    StructField("Delivery Interval", types.IntegerType, nullable = false),
    StructField("Repeated Hour Flag", types.StringType, nullable = false),
    StructField("Settlement Point Name", types.StringType, nullable = false),
    StructField("Settlement Point Type", types.StringType, nullable = false),
    StructField("Settlement Point Price", types.DecimalType(10, 0), nullable = false)
  ))
  val dir = new File("data/")
  val files = dir.listFiles.map(_.getPath).toList
  def read_excel(filePath: String): DataFrame = {
    session.read.excel(header=true). 
      schema(dataSchema).
      load(filePath)
  }
  val df = files.map(f => read_excel(f))
  val mdf = df.reduce(_.union(_))
  mdf.show(5)
}

Things I've tried:

VM Options: -Xmx -Xms, and expanding various memory types inside the code's spark session config. My machine has 32gb of RAM, so that isn't an issue.

英文:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread &quot;RemoteBlock-temp-file-clean-thread&quot;
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread &quot;Spark Context Cleaner&quot;
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread &quot;Executor task launch worker for task 0.0 in stage 0.0 (TID 0)&quot;
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread &quot;executor-kill-mark-cleanup&quot;
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread &quot;Executor task launch worker for task 8.0 in stage 0.0 (TID 8)&quot;
java.lang.OutOfMemoryError: Java heap space

I am running this code locally using IntellijIDEA:

import com.crealytics.spark.excel._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.{DataFrame, SparkSession, types}
import java.io.File
object sparkJob extends App {
  val session = SparkSession.builder().
    config(&quot;spark.driver.bindAddress&quot;, &quot;127.0.0.1&quot;).
    config(&quot;spark.executor.memory&quot;, &quot;8g&quot;).
    config(&quot;spark.driver.memory&quot;, &quot;8g&quot;).
    config(&quot;spark.memory.offHeap.enabled&quot;, true).
    config(&quot;spark.memory.offHeap.size&quot;, &quot;4g&quot;).
    master(&quot;local[*]&quot;).
    appName(&quot;etl&quot;).
    getOrCreate()
  val dataSchema = types.StructType(Array(
    StructField(&quot;Delivery Date&quot;, types.StringType, nullable = false),
    StructField(&quot;Delivery Hour&quot;, types.IntegerType, nullable = false),
    StructField(&quot;Delivery Interval&quot;, types.IntegerType, nullable = false),
    StructField(&quot;Repeated Hour Flag&quot;, types.StringType, nullable = false),
    StructField(&quot;Settlement Point Name&quot;, types.StringType, nullable = false),
    StructField(&quot;Settlement Point Type&quot;, types.StringType, nullable = false),
    StructField(&quot;Settlement Point Price&quot;, types.DecimalType(10, 0), nullable = false)
  ))
  val dir = new File(&quot;data/&quot;)
  val files = dir.listFiles.map(_.getPath).toList
  def read_excel(filePath: String): DataFrame = {
    session.read.excel(header=true). 
      schema(dataSchema).
      load(filePath)
  }
  val df = files.map(f =&gt; read_excel(f))
  val mdf = df.reduce(_.union(_))
  mdf.show(5)
}

Things I've tried:

VM Options: -Xmx -Xms, and expanding various memory types inside the code's spark session config. My machine has 32gb of RAM, so that isn't an issue.

答案1

得分: 0

Use parallelize instead of map to read files in parallel. This way Spark will distribute jobs among cluster nodes and use parallel processing to improve performance. For example, you can create an RDD from the list of files and then use map on the RDD:

使用parallelize而不是map来并行读取文件。这样，Spark将在集群节点之间分配任务并使用并行处理以提高性能。例如，您可以从文件列表创建一个RDD，然后在RDD上使用map：

val filesRDD = session.sparkContext.parallelize(files)
val df = filesRDD.map(f => read_excel(f))

Use cache to store the DataFrame. This way, the data will be cached and will not have to be read from disk every time an action is performed on it:

使用缓存来存储DataFrame。这样，数据将被缓存，每次对其执行操作时都不必从磁盘上读取：

val mdf = df.reduce(_.union(_)).cache()

the last attempt you can try to do is to set: spark.executor.memory=12g, but I think it is an extreme solution, it might be interesting to debug the excel decoding library to see if the high memory usage is given by it.

最后尝试的一步是设置：spark.executor.memory=12g，但我认为这是一个极端的解决方案，可能会有兴趣调试Excel解码库，以查看高内存使用是否由它引起。

英文:

val filesRDD = session.sparkContext.parallelize(files)
val df = filesRDD.map(f =&gt; read_excel(f))

Use cache to store the DataFrame. This way, the data will be cached and will not have to be read from disk every time an action is performed on it:

val mdf = df.reduce(_.union(_)).cache()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Memory issues running spark locally in Intellij (scala)

问题

答案1

从Kafka主题读取消息并将其转储到HDFS。

Spark Scala – 将查询结果存储为 Scala 整数类型

如何使用Scala的sttp客户端下载文件。

为什么 Spark 模式的 .simpleString() 方法会截断我的输出？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。