2023年6月6日 12:06:24go评论131阅读模式

英文:

Spark 2.3.1 => 2.4 increases runtime 6-fold

问题

I'm being forced onto a newer EMR version (5.23.1, 5.27.1, or 5.32+) by our cloud team, which is forcing me up from 5.17.0 w/ Spark 2.3.1 to Spark 2.4.x. The impetus is to allow a security configuration that forbids Instance Metadata Service Version 1 (although I've tested it without any security config attached, and also tested 5.20.1 which has no option for a security config and also runs spark 2.4.0).

The runtime on a simple ETL job increases 6x on Spark 2.4 (compared to 2.3.1) with no code changes except that spark version. There's leftOuterJoin's on big RDDs in 3 of the 4 stages that have the biggest slowdowns.

I get no errors, just a 6x increase in time/cost. All code is compiled w/ Java 8.

EDIT

Confusingly, this is one snippet of offending code where I can reproduce the problem in spark-shell, but it does very little in the test run (because the if criteria evaluates to false). No joins, no pulling data off disk... it just takes an existing RDD that's been materialized already, calls it something new and persists to disk. I persist other RDDs to disk with no problem. In EMR 5.17 this snippet takes 4.6 minutes, and in 5.23.1 it takes 20 minutes.

val rddWSiteB: RDD[StagedFormat] = {
  if (false) {          // &lt;-- evaluates some stuff that&#39;s false
    val site5gnsaLookup = new Site5gnsaLookup(spark, req)
    site5gnsaLookup.rddWithSiteMeta(rddWSite)
  } 
  else {
    rddWSite // &lt;-- this is all that&#39;s happening; literally nothing
  }
}

rddWSiteB.setName(&quot;GetExternalMeta.rddWSiteB&quot;)

// THIS is the problem
// WHY does serializing/persisting to disk take 10x as long 
//   in 2.4 vs 2.3?
rddWSiteB.persist(StorageLevel.DISK_ONLY)
rddWSiteB.count

END EDIT

I've read the Cloudera 2.3 => 2.4 migration guide and nothing seems relevant. Everything else I can find from databricks and blogs, it seems like most of the changes affect SQL and dataframes, but I use JSON and CSV text straight into RDDs.

I'm at a loss. With no errors, I don't really know how to fix this, but I can't imagine there's any logical reason for a 6x increase in runtime. I'm not really sure what to do next or what's going on. Any ideas to troubleshoot?

Lastly, I don't think my config is the problem, but in the interest of throwing a bunch of stuff out here in the absence of anything directly useful, I use the following config.

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.dynamicAllocation.enabled": "false",
      "spark.executor.instances": "284",
      "spark.executor.memory": "35977M",
      "spark.executor.memoryOverhead": "4497M",
      "spark.executor.cores": "5",
      "spark.driver.memory": "51199M",
      "spark.driver.memoryOverhead": "5119M",
      "spark.driver.cores": "15",
      "spark.default.parallelism": "4245",
      "spark.shuffle.compress": "true",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.driver.maxResultSize": "0",
      "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
      "spark.network.timeout": "600s",
      "spark.rpc.message.maxSize": "512",
      "spark.scheduler.listenerbus.eventqueue.capacity": "100000",
      "spark.kryoserializer.buffer.max": "256m"
    }
  },
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3a.endpoint": "s3.amazonaws.com"
    }
  },
  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3.enableServerSideEncryption": "true",
      "fs.s3.maxRetries": "20"
    }
  },
  {
    "Classification": "spark-env",
    "Properties": {},
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
           <Env Variables Removed>
        }
      }
    ]
  },
  {
    "Classification": "spark-log4j",
    "Properties": {
      "log4j.rootCategory": "INFO, console",
      "log4j.logger.com.tmobile": "DEBUG",
      "log4j.appender.console.target": "System.err",
      "log4j.appender.console": "org.apache.log4j.ConsoleAppender",
      "log4j.appender.console.layout": "org.apache.log4j.EnhancedPatternLayout",
      "log4j.appender.console.layout.ConversionPattern": "%d{yyyy/MM/dd HH:mm:ss} [%10.10t] %-5p %-30.30c: %m%n",
      "log4j.logger.com.amazonaws.latency": "WARN",
      "log4j.logger.org": "WARN"
    }
  },
  {
    "Classification": "yarn-site",
    "Properties": {
      "yarn.nodemanager.pmem-check-enabled": "false",
      "yarn.nodemanager.vmem-check-enabled": "false"
    }
  }
]

英文:

I get no errors, just a 6x increase in time/cost. All code is compiled w/ Java 8.

EDIT

    val rddWSiteB: RDD[StagedFormat] = {
if (false) {          // &lt;-- evaluates some stuff that&#39;s false
val site5gnsaLookup = new Site5gnsaLookup(spark, req)
site5gnsaLookup.rddWithSiteMeta(rddWSite)
} 
else {
rddWSite // &lt;-- this is all that&#39;s happening; literally nothing
}
}
rddWSiteB.setName(&quot;GetExternalMeta.rddWSiteB&quot;)
// THIS is the problem
// WHY does serializing/persisting to disk take 10x as long 
//   in 2.4 vs 2.3?
rddWSiteB.persist(StorageLevel.DISK_ONLY)
rddWSiteB.count

END EDIT

Lastly, I don't think my config is the problem, but in the interest of throwing a bunch of stuff out here in the absence of anything directly useful, I use the following config.

    [
{
&quot;Classification&quot;: &quot;spark-defaults&quot;,
&quot;Properties&quot;: {
&quot;spark.dynamicAllocation.enabled&quot;: &quot;false&quot;,
&quot;spark.executor.instances&quot;: &quot;284&quot;,
&quot;spark.executor.memory&quot;: &quot;35977M&quot;,
&quot;spark.executor.memoryOverhead&quot;: &quot;4497M&quot;,
&quot;spark.executor.cores&quot;: &quot;5&quot;,
&quot;spark.driver.memory&quot;: &quot;51199M&quot;,
&quot;spark.driver.memoryOverhead&quot;: &quot;5119M&quot;,
&quot;spark.driver.cores&quot;: &quot;15&quot;,
&quot;spark.default.parallelism&quot;: &quot;4245&quot;,
&quot;spark.shuffle.compress&quot;: &quot;true&quot;,
&quot;spark.serializer&quot;: &quot;org.apache.spark.serializer.KryoSerializer&quot;,
&quot;spark.driver.maxResultSize&quot;: &quot;0&quot;,
&quot;spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version&quot;: &quot;2&quot;,
&quot;spark.network.timeout&quot;: &quot;600s&quot;,
&quot;spark.rpc.message.maxSize&quot;: &quot;512&quot;,
&quot;spark.scheduler.listenerbus.eventqueue.capacity&quot;: &quot;100000&quot;,
&quot;spark.kryoserializer.buffer.max&quot;: &quot;256m&quot;
}
},
{
&quot;Classification&quot;: &quot;core-site&quot;,
&quot;Properties&quot;: {
&quot;fs.s3a.endpoint&quot;: &quot;s3.amazonaws.com&quot;
}
},
{
&quot;Classification&quot;: &quot;emrfs-site&quot;,
&quot;Properties&quot;: {
&quot;fs.s3.enableServerSideEncryption&quot;: &quot;true&quot;,
&quot;fs.s3.maxRetries&quot;: &quot;20&quot;
}
},
{
&quot;Classification&quot;: &quot;spark-env&quot;,
&quot;Properties&quot;: {},
&quot;Configurations&quot;: [
{
&quot;Classification&quot;: &quot;export&quot;,
&quot;Properties&quot;: {
&lt;Env Variables Removed&gt;
}
}
]
},
{
&quot;Classification&quot;: &quot;spark-log4j&quot;,
&quot;Properties&quot;: {
&quot;log4j.rootCategory&quot;: &quot;INFO, console&quot;,
&quot;log4j.logger.com.tmobile&quot;: &quot;DEBUG&quot;,
&quot;log4j.appender.console.target&quot;: &quot;System.err&quot;,
&quot;log4j.appender.console&quot;: &quot;org.apache.log4j.ConsoleAppender&quot;,
&quot;log4j.appender.console.layout&quot;: &quot;org.apache.log4j.EnhancedPatternLayout&quot;,
&quot;log4j.appender.console.layout.ConversionPattern&quot;: &quot;%d{yyyy/MM/dd HH:mm:ss} [%10.10t] %-5p %-30.30c: %m%n&quot;,
&quot;log4j.logger.com.amazonaws.latency&quot;: &quot;WARN&quot;,
&quot;log4j.logger.org&quot;: &quot;WARN&quot;
}
},
{
&quot;Classification&quot;: &quot;yarn-site&quot;,
&quot;Properties&quot;: {
&quot;yarn.nodemanager.pmem-check-enabled&quot;: &quot;false&quot;,
&quot;yarn.nodemanager.vmem-check-enabled&quot;: &quot;false&quot;
}
}
]

答案1

得分: 0

这变成了一个坚持使用StorageLevel.DISK_ONLY并使用EBS存储的问题。从EMR 5.17升级到5.36.1（spark 2.3.1到2.4.8）后，缓存所需的时间增加了约10倍，同时使用了内存优化的EC2实例（r5.24xlarge）和EBS存储。

我的解决方案是迁移到存储优化实例（i4i.32xlarge），这些实例内置了SSD阵列，写入速度更快。

现在我的作业运行得更快，缓存速度更快... 但这些实例大约贵两倍，因此总成本增加了40%。

英文:

This turned out to be a matter of persisting to StorageLevel.DISK_ONLY while using EBS storage. The time involved to cache increased about 10x with the move from EMR 5.17 to 5.36.1 (spark 2.3.1 to 2.4.8) while using memory optimized EC2 instances (r5.24xlarge) with EBS storage.

My solution was move to storage optimized instances (i4i.32xlarge), which have SSD arrays built in with much faster write speeds.

My jobs run faster now, with the faster caching speeds... but the instances are about 2x more expensive, so overall my cost is up 40%.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark 2.3.1 => 2.4 增加运行时间 6 倍。

问题

答案1

在Spark会话中设置 “table”。

将Java的Timestamp数据类型转换为Scala的TimestampType。

在每个Apache Spark工作节点上创建一个Java HBase客户端实例。

如何解决 Python 的 ModuleNotFoundError 错误

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论