问题

我需要将我们的job.conf文件中的Spark配置外部化，以便它们可以从外部位置读取，并仅在运行时在该外部位置进行修改。

配置项，如
spark.executor.memory
spark.executor.cores
spark.executor.instances
spark.sql.adaptive.enabled
spark.sql.legacy.timeParserPolicy

将存储在这个文件中。

我对此非常陌生，在网上找到的关于处理此过程的资源非常有限。我看过一些关于使用Scala文件处理这个问题的YouTube视频。任何帮助将不胜感激。

我已经尝试模仿我在网上看到的Scala示例，但不知道如何从Spark中调用生成的文件（甚至不确定Scala是否正确起步）。

英文:

I need to externalize the Spark Configs in our job.conf files so that they can be read from an external location and modified only in that one external location to use at runtime.

Configs such as
spark.executor.memory
spark.executor.cores
spark.executor.instances
spark.sql.adaptive.enabled
spark.sql.legacy.timeParserPolicy

Would be stored in this file.

I am very new to this and am finding very limited resources on the web about handling this process. I've seen a couple of YouTubes about using a scala file to handle this. Any assistance would be greatly appreciated.

I have attempted to emulate the scala examples I have seen online, but don't know how to call the resulting file from Spark (or even if the scala is correct to begin with).

答案1

得分: 1

你可以将配置放在 $SPARK_HOME/conf/spark-defaults.conf 中。
或者，如果你明确使用 spark-submit 或其他方式提交作业，也可以使用 --conf 参数在命令行中传递它们。

英文:

TL;DR:

you can put your config in $SPARK_HOME/conf/spark-defaults.conf
or if you're submitting your jobs explicitly using spark-submit or something then you can also pass them on command line using --conf.

Spark configuration docs leave a bit to be desired.

As described in Dynamically Loading Spark Properties section:

> bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace. For example:
>
> > spark.master spark://5.6.7.8:7077 > spark.executor.memory 4g > spark.eventLog.enabled true > spark.serializer org.apache.spark.serializer.KryoSerializer >

Official documentation doesn't explicitly mention the location except in passing in this para related to hadoop config.

Some IBM doc has it more explicitly.

Also FYI: https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark/75751442#75751442

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

外部化Spark配置

问题

答案1

如何使用正则表达式解决这个Pyspark代码块

终端在我尝试使用Scala时为什么会给我一个错误消息？

Failed to load: com/amazon/deequ/checks/Check

如何在SparkSession中注册StreamingListener？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论