问题

我正在本地模式下运行Spark，但在执行简单的groupBy操作时，我期望会有多个分区存在，但分区计数始终为1：

val spark = SparkSession.builder()
  .appName("Test App")
  .master("local[5]")
  .getOrCreate()
spark.sparkContext.setLogLevel("OFF")

import spark.implicits._

val simpleData = Seq(("James", "Sales", "NY", 90000, 34, 10000),
  ("Michael", "Sales", "NY", 86000, 56, 20000),
  ("Robert", "Sales", "CA", 81000, 30, 23000),
  ("Maria", "Finance", "CA", 90000, 24, 23000),
  ("Raman", "Finance", "CA", 99000, 40, 24000),
  ("Scott", "Finance", "NY", 83000, 36, 19000),
  ("Jen", "Finance", "NY", 79000, 53, 15000),
  ("Jeff", "Marketing", "CA", 80000, 25, 18000),
  ("Kumar", "Marketing", "NY", 91000, 50, 21000)
)
val df = simpleData.toDF("employee_name", "department", "state", "salary", "age", "bonus")

val df2 = df.groupBy("state").count()

println(df2.rdd.getNumPartitions) // 总是打印1

上述代码会打印1。由于默认的洗牌分区数是200，所以我期望会有多于1个分区。我甚至手动设置了洗牌分区数，但仍然得到1：

spark.conf.set("spark.sql.shuffle.partitions", 500)

是某些设置不正确，还是我对洗牌分区的理解有误？

英文:

I'm running Spark in local mode and I am expecting multiple partitions to be present when doing simple groupBy but the partition count is always coming out as 1:

val spark = SparkSession.builder()
  .appName(&quot;Test App&quot;)
  .master(&quot;local[5]&quot;)
  .getOrCreate()
spark.sparkContext.setLogLevel(&quot;OFF&quot;)

import spark.implicits._

val simpleData = Seq((&quot;James&quot;, &quot;Sales&quot;, &quot;NY&quot;, 90000, 34, 10000),
  (&quot;Michael&quot;, &quot;Sales&quot;, &quot;NY&quot;, 86000, 56, 20000),
  (&quot;Robert&quot;, &quot;Sales&quot;, &quot;CA&quot;, 81000, 30, 23000),
  (&quot;Maria&quot;, &quot;Finance&quot;, &quot;CA&quot;, 90000, 24, 23000),
  (&quot;Raman&quot;, &quot;Finance&quot;, &quot;CA&quot;, 99000, 40, 24000),
  (&quot;Scott&quot;, &quot;Finance&quot;, &quot;NY&quot;, 83000, 36, 19000),
  (&quot;Jen&quot;, &quot;Finance&quot;, &quot;NY&quot;, 79000, 53, 15000),
  (&quot;Jeff&quot;, &quot;Marketing&quot;, &quot;CA&quot;, 80000, 25, 18000),
  (&quot;Kumar&quot;, &quot;Marketing&quot;, &quot;NY&quot;, 91000, 50, 21000)
)
val df = simpleData.toDF(&quot;employee_name&quot;, &quot;department&quot;, &quot;state&quot;, &quot;salary&quot;, &quot;age&quot;, &quot;bonus&quot;)

val df2 = df.groupBy(&quot;state&quot;).count()

println(df2.rdd.getNumPartitions) // always prints 1

Above prints 1. Since the default shuffle partition count is 200, I'm expecting there to be more than 1 partition. I even set the shuffle partition count manually but still getting 1:

spark.conf.set(&quot;spark.sql.shuffle.partitions&quot;, 500)

Is some setting not correct, or my understanding is wrong about shuffle partitions?

答案1

得分: 1

这是由于自适应查询执行，它通过查看实际数据量（以及其他因素）在运行时优化了事务。按照下面的方式将其关闭，将会得到预期的行为：

...
import spark.implicits._
spark.conf.set("spark.sql.adaptive.enabled", false) // <-- 这里

val simpleData = Seq(("James", "Sales", "NY", 90000, 34, 10000),
...

然后，您将获得200或您明确设置的任何其他值。例如：

200
import spark.implicits._
simpleData: Seq[(String, String, String, Int, Int, Int)] = List((James, Sales, NY, 90000, 34, 10000), (Michael, Sales, NY, 86000, 56, 20000), (Robert, Sales, CA, 81000, 30, 23000), (Maria, Finance, CA, 90000, 24, 23000), (Raman, Finance, CA, 99000, 40, 24000), (Scott, Finance, NY, 83000, 36, 19000), (Jen, Finance, NY, 79000, 53, 15000), (Jeff, Marketing, CA, 80000, 25, 18000), (Kumar, Marketing, NY, 91000, 50, 21000))
df: org.apache.spark.sql.DataFrame = [employee_name: string, department: string ... 4 more fields]
df2: org.apache.spark.sql.DataFrame = [state: string, count: bigint]

这里有一个链接：https://sparkbyexamples.com/spark/spark-adaptive-query-execution/

英文:

This is due to Adaptative Query Execution which optimizes things at runtime by looking at the actual data volume (among other things).

Turning it off as per below will give you expected behaviour:

...
import spark.implicits._
spark.conf.set(&quot;spark.sql.adaptive.enabled&quot;, false) // &lt;-- this
val simpleData = Seq((&quot;James&quot;, &quot;Sales&quot;, &quot;NY&quot;, 90000, 34, 10000),
...

You will then get 200 or any other value you explicitly set. E.g.

200
import spark.implicits._
simpleData: Seq[(String, String, String, Int, Int, Int)] = List((James,Sales,NY,90000,34,10000), (Michael,Sales,NY,86000,56,20000), (Robert,Sales,CA,81000,30,23000), (Maria,Finance,CA,90000,24,23000), (Raman,Finance,CA,99000,40,24000), (Scott,Finance,NY,83000,36,19000), (Jen,Finance,NY,79000,53,15000), (Jeff,Marketing,CA,80000,25,18000), (Kumar,Marketing,NY,91000,50,21000))
df: org.apache.spark.sql.DataFrame = [employee_name: string, department: string ... 4 more fields]
df2: org.apache.spark.sql.DataFrame = [state: string, count: bigint]

Here a link https://sparkbyexamples.com/spark/spark-adaptive-query-execution/

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark分区数始终为1吗？

问题

答案1

Scala 2.13 error: type mismatch found Iterable[String,Double] required: java.util.Map[String,Double]

创建基于现有列数据的新列。

java.lang.ClassNotFoundException: com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper

如何在 Java 11 中运行 Scala REPL？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论