英文:
Spark v3.0.0 - WARN DAGScheduler: broadcasting large task binary with size xx
问题
我对Spark还不太了解。我正在使用Spark独立版(v3.0.0)编写机器学习算法,并进行了以下配置:
SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");
以下是我创建CrossValidator的方式:
ParamMap[] paramGrid = new ParamGridBuilder()
.addGrid(gbt.maxBins(), new int[]{50})
.addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
.addGrid(gbt.maxIter(), new int[]{5, 20, 40})
.addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
.build();
CrossValidator gbcv = new CrossValidator()
.setEstimator(gbt)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(gbevaluator)
.setNumFolds(5)
.setParallelism(8)
.setSeed(session.getArguments().getTrainingRandom());
问题在于,当maxDepth(在paramGrid中)为{2, 5},而maxIter为{5, 20}时,一切正常运行,但当像上面的代码一样时,它会不断记录:
WARN DAGScheduler: broadcasting large task binary with size xx
,
其中xx的大小从1000 KiB增加到2.9 MiB,通常会导致超时异常。我应该修改哪些Spark参数以避免这种情况发生?
英文:
I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set:
SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");
This is how I create the CrossValidator
ParamMap[] paramGrid = new ParamGridBuilder()
.addGrid(gbt.maxBins(), new int[]{50})
.addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
.addGrid(gbt.maxIter(), new int[]{5, 20, 40})
.addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
.build();
CrossValidator gbcv = new CrossValidator()
.setEstimator(gbt)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(gbevaluator)
.setNumFolds(5)
.setParallelism(8)
.setSeed(session.getArguments().getTrainingRandom());
The problem is that when (in paramGrid) maxDepth is just {2, 5} and maxIter {5, 20} all works just fine, but when it is like in the code above it keeps logging:
WARN DAGScheduler: broadcasting large task binary with size xx
,
with xx going from 1000 KiB to 2.9 MiB, often leading to a timeout exception
Which spark parameters should i change to avoid this?
答案1
得分: 4
超时问题请考虑更改以下配置:
将 spark.sql.autoBroadcastJoinThreshold
更改为 -1。
这将移除广播大小的限制,该限制为 10MB。
英文:
For timeout issue consider changing the following configuration:
spark.sql.autoBroadcastJoinThreshold to -1.
This will remove this limit of broadcast size which is 10MB.
答案2
得分: 0
对我有效的解决方案是:
减少任务规模 => 减少处理的数据量
首先,通过 df.rdd.getNumPartitions()
检查数据框中的分区数。
然后,增加分区:df.repartition(100)
。
英文:
Solution that worked for me was:
reducing task size => reduce the data its handling
First, check number of partitions in dataframe via df.rdd.getNumPartitions()
After, increase partitions: df.repartition(100)
答案3
得分: 0
我遇到了类似的WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 5.2 MiB
的问题,对我有效的解决方法是,我将机器配置从2vCPU、7.5GB RAM增加到了4vCPU、15GB RAM(某些Parquet文件已经创建,但作业从未完成,因此我增加到了8vCPU、32GB RAM(现在一切都正常)。这是在GCP Dataproc上进行的。
英文:
I got similar WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 5.2 MiB
What worked for me, I increase the Machine Configuration from 2vCPU, 7.5GB RAM, to 4vCPU 15GBRAM (Some parquet file were created but job never complete, hence I increase to 8vCPU 32GB RAM (everything now work). This is on GCP Dataproc
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论