2023年6月19日 17:17:46go评论81阅读模式

英文:

Spark session value not updating

问题

我正在使用以下代码设置Spark会话值：

spark = (SparkSession
                    .builder
                    .appName('LoadDev1')
                    #.config("spark.master","local[2]")
                    .config("spark.master","yarn")
                    .config("spark.yarn.queue","uldp")
                    .config("spark.tez.queue","uldp")
                    .config("spark.executor.instances","5")
					.enableHiveSupport()
                    .getOrCreate()
              )
   return spark

但是，当我在程序中打印值时，例如 spark.executor.instances 的值为10，尽管我可以看到如果我更改配置文件，则 appname 会更改，这使我相信确实读取了该配置文件，但某种方式值被覆盖。

如果我使用 --conf 提供值，则会反映出来，但我想使用配置文件而不是 --conf。

请帮助我解决这个问题。

英文:

I am setting spark session value using below code

    spark = (SparkSession
                        .builder
                        .appName(&#39;LoadDev1&#39;)
                        #.config(&quot;spark.master&quot;,&quot;local[2]&quot;)
                        .config(&quot;spark.master&quot;,&quot;yarn&quot;)
                        .config(&quot;spark.yarn.queue&quot;,&quot;uldp&quot;)
                        .config(&quot;spark.tez.queue&quot;,&quot;uldp&quot;)
                        .config(&quot;spark.executor.instances&quot;,&quot;5&quot;)
    					.enableHiveSupport()
                        .getOrCreate()
                  )
           return spark
spark-submit --jars /app/spark3.3.1/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar  --conf spark.sql.shuffle.partitions=100 --conf spark.hive.vectorized.execution.enabled=false --py-files 
/home/path/SparkFactory_iceberg1.py

But when I print the values inside my program for example spark.executor.instances value = 10 although I can see that appname is changing if I am changing the config file which make me believe that this config file is indeed read but somehow values are overwritten.

If I provide the value using --conf it is reflected but I want to use config file rather than --conf.

Please help me with this.

答案1

得分: 1

spark-submit 是一个完全独立的应用程序，与您在顶部创建的会话不同，因此您需要将这些配置传递给 spark-submit 命令，并且您可以创建一个 properties 文件，它将覆盖位于 conf/spark-defaults.conf 的默认 Spark 配置，配置如下：

app.conf

spark.master yarn
spark.yarn.queue uldp
spark.tez.queue uldp
spark.executor.instances 5
spark.sql.shuffle.partitions 100
spark.hive.vectorized.execution.enabled false

$ spark-submit \
    --properties-file <PATH>/app.conf \
    --jars /app/spark3.3.1/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar \
    --py-files /home/path/SparkFactory_iceberg1.py \
    /home/path/main.py

(Note: I've removed the HTML entities in the --properties-file line for clarity, as they appear to be HTML-encoded characters. Please use the actual path accordingly.)

英文:

spark-submit is a completely separate application than the session you create at the top, so you need to pass those configs into the spark-submit command, and you can create a properties-file which will override the default spark config at conf/spark-defaults.conf with the configs like this:

app.conf

spark.master yarn
spark.yarn.queue uldp
spark.tez.queue uldp
spark.executor.instances 5
spark.sql.shuffle.partitions 100
spark.hive.vectorized.execution.enabled false

$ spark-submit \
    --properties-file &lt;PATH&gt;/app.conf \
    --jars /app/spark3.3.1/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar \
    --py-files /home/path/SparkFactory_iceberg1.py \
    /home/path/main.py

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark会话值未更新

问题

答案1

Apache Spark：连接两个Hive表的最佳方法。

你试图访问一个列，但有多个列具有相同的名称。

Spark ETL大数据传输 – 如何并行化

pytest unittest spark java.io.FileNotFoundException: HADOOP_HOME 和 hadoop.home.dir 未设置

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。