Spark v3.0.0 – 警告 DAGScheduler:广播大小为 xx 的大任务二进制文件。

huangapple go评论81阅读模式
英文:

Spark v3.0.0 - WARN DAGScheduler: broadcasting large task binary with size xx

问题

我对Spark还不太了解。我正在使用Spark独立版(v3.0.0)编写机器学习算法,并进行了以下配置:

SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");

以下是我创建CrossValidator的方式:

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

问题在于,当maxDepth(在paramGrid中)为{2, 5},而maxIter为{5, 20}时,一切正常运行,但当像上面的代码一样时,它会不断记录:
WARN DAGScheduler: broadcasting large task binary with size xx
其中xx的大小从1000 KiB增加到2.9 MiB,通常会导致超时异常。我应该修改哪些Spark参数以避免这种情况发生?

英文:

I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set:

SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");

This is how I create the CrossValidator

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

    CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

The problem is that when (in paramGrid) maxDepth is just {2, 5} and maxIter {5, 20} all works just fine, but when it is like in the code above it keeps logging:
WARN DAGScheduler: broadcasting large task binary with size xx,
with xx going from 1000 KiB to 2.9 MiB, often leading to a timeout exception
Which spark parameters should i change to avoid this?

答案1

得分: 4

超时问题请考虑更改以下配置:

spark.sql.autoBroadcastJoinThreshold 更改为 -1。

这将移除广播大小的限制,该限制为 10MB。

英文:

For timeout issue consider changing the following configuration:

spark.sql.autoBroadcastJoinThreshold to -1.

This will remove this limit of broadcast size which is 10MB.

答案2

得分: 0

对我有效的解决方案是:

减少任务规模 => 减少处理的数据量

首先,通过 df.rdd.getNumPartitions() 检查数据框中的分区数。
然后,增加分区:df.repartition(100)

英文:

Solution that worked for me was:

reducing task size => reduce the data its handling

First, check number of partitions in dataframe via df.rdd.getNumPartitions()
After, increase partitions: df.repartition(100)

答案3

得分: 0

我遇到了类似的WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 5.2 MiB的问题,对我有效的解决方法是,我将机器配置从2vCPU、7.5GB RAM增加到了4vCPU、15GB RAM(某些Parquet文件已经创建,但作业从未完成,因此我增加到了8vCPU、32GB RAM(现在一切都正常)。这是在GCP Dataproc上进行的。

英文:

I got similar WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 5.2 MiB What worked for me, I increase the Machine Configuration from 2vCPU, 7.5GB RAM, to 4vCPU 15GBRAM (Some parquet file were created but job never complete, hence I increase to 8vCPU 32GB RAM (everything now work). This is on GCP Dataproc

huangapple
  • 本文由 发表于 2020年9月2日 18:52:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/63703994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定