外部化Spark配置

huangapple go评论81阅读模式
英文:

Externalize Spark Configurations

问题

我需要将我们的job.conf文件中的Spark配置外部化,以便它们可以从外部位置读取,并仅在运行时在该外部位置进行修改。

配置项,如
spark.executor.memory
spark.executor.cores
spark.executor.instances
spark.sql.adaptive.enabled
spark.sql.legacy.timeParserPolicy

将存储在这个文件中。

我对此非常陌生,在网上找到的关于处理此过程的资源非常有限。我看过一些关于使用Scala文件处理这个问题的YouTube视频。任何帮助将不胜感激。

我已经尝试模仿我在网上看到的Scala示例,但不知道如何从Spark中调用生成的文件(甚至不确定Scala是否正确起步)。

英文:

I need to externalize the Spark Configs in our job.conf files so that they can be read from an external location and modified only in that one external location to use at runtime.

Configs such as
spark.executor.memory
spark.executor.cores
spark.executor.instances
spark.sql.adaptive.enabled
spark.sql.legacy.timeParserPolicy

Would be stored in this file.

I am very new to this and am finding very limited resources on the web about handling this process. I've seen a couple of YouTubes about using a scala file to handle this. Any assistance would be greatly appreciated.

I have attempted to emulate the scala examples I have seen online, but don't know how to call the resulting file from Spark (or even if the scala is correct to begin with).

答案1

得分: 1

  • 你可以将配置放在 $SPARK_HOME/conf/spark-defaults.conf 中。
  • 或者,如果你明确使用 spark-submit 或其他方式提交作业,也可以使用 --conf 参数在命令行中传递它们。
英文:

TL;DR:

  • you can put your config in $SPARK_HOME/conf/spark-defaults.conf
  • or if you're submitting your jobs explicitly using spark-submit or something then you can also pass them on command line using --conf.

Spark configuration docs leave a bit to be desired.

As described in Dynamically Loading Spark Properties section:

> bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace. For example:
>
>
> spark.master spark://5.6.7.8:7077
> spark.executor.memory 4g
> spark.eventLog.enabled true
> spark.serializer org.apache.spark.serializer.KryoSerializer
>

Official documentation doesn't explicitly mention the location except in passing in this para related to hadoop config.

Some IBM doc has it more explicitly.

Also FYI: https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark/75751442#75751442

huangapple
  • 本文由 发表于 2023年3月7日 22:39:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75663377.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定