2020年9月2日 18:52:03go评论161阅读模式

英文:

Spark v3.0.0 - WARN DAGScheduler: broadcasting large task binary with size xx

问题

我对Spark还不太了解。我正在使用Spark独立版（v3.0.0）编写机器学习算法，并进行了以下配置：

SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");

以下是我创建CrossValidator的方式：

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

问题在于，当maxDepth（在paramGrid中）为{2, 5}，而maxIter为{5, 20}时，一切正常运行，但当像上面的代码一样时，它会不断记录：
WARN DAGScheduler: broadcasting large task binary with size xx，
其中xx的大小从1000 KiB增加到2.9 MiB，通常会导致超时异常。我应该修改哪些Spark参数以避免这种情况发生？

英文:

I'm new to spark. I'm coding a machine learning algorithm in Spark standalone (v3.0.0) with this configurations set:

SparkConf conf = new SparkConf();
conf.setMaster(&quot;local[*]&quot;);
conf.set(&quot;spark.driver.memory&quot;, &quot;8g&quot;);
conf.set(&quot;spark.driver.maxResultSize&quot;, &quot;8g&quot;);
conf.set(&quot;spark.memory.fraction&quot;, &quot;0.6&quot;);
conf.set(&quot;spark.memory.storageFraction&quot;, &quot;0.5&quot;);
conf.set(&quot;spark.sql.shuffle.partitions&quot;, &quot;5&quot;);
conf.set(&quot;spark.memory.offHeap.enabled&quot;, &quot;false&quot;);
conf.set(&quot;spark.reducer.maxSizeInFlight&quot;, &quot;96m&quot;);
conf.set(&quot;spark.shuffle.file.buffer&quot;, &quot;256k&quot;);
conf.set(&quot;spark.sql.debug.maxToStringFields&quot;, &quot;100&quot;);

This is how I create the CrossValidator

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

    CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

The problem is that when (in paramGrid) maxDepth is just {2, 5} and maxIter {5, 20} all works just fine, but when it is like in the code above it keeps logging:
WARN DAGScheduler: broadcasting large task binary with size xx,
with xx going from 1000 KiB to 2.9 MiB, often leading to a timeout exception
Which spark parameters should i change to avoid this?

答案1

得分: 4

超时问题请考虑更改以下配置：

将 spark.sql.autoBroadcastJoinThreshold 更改为 -1。

这将移除广播大小的限制，该限制为 10MB。

英文:

For timeout issue consider changing the following configuration:

spark.sql.autoBroadcastJoinThreshold to -1.

This will remove this limit of broadcast size which is 10MB.

答案2

得分: 0

对我有效的解决方案是：

减少任务规模 => 减少处理的数据量

首先，通过 df.rdd.getNumPartitions() 检查数据框中的分区数。
然后，增加分区：df.repartition(100)。

英文:

Solution that worked for me was:

reducing task size => reduce the data its handling

First, check number of partitions in dataframe via df.rdd.getNumPartitions()
After, increase partitions: df.repartition(100)

答案3

得分: 0

我遇到了类似的WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 5.2 MiB的问题，对我有效的解决方法是，我将机器配置从2vCPU、7.5GB RAM增加到了4vCPU、15GB RAM（某些Parquet文件已经创建，但作业从未完成，因此我增加到了8vCPU、32GB RAM（现在一切都正常）。这是在GCP Dataproc上进行的。

英文:

I got similar WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 5.2 MiB What worked for me, I increase the Machine Configuration from 2vCPU, 7.5GB RAM, to 4vCPU 15GBRAM (Some parquet file were created but job never complete, hence I increase to 8vCPU 32GB RAM (everything now work). This is on GCP Dataproc

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark v3.0.0 – 警告 DAGScheduler：广播大小为 xx 的大任务二进制文件。

问题

答案1

答案2

答案3

访问 Google 电子表格 API，使用筛选条件获取数据。

使用Java中的SOAP工件

如何在Java中为字符串和整数添加循环。

为什么错误的 @MapperScan 能够正常运行

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论