Spark 2.3与Spark 3.2中相同查询的不同行为

huangapple go评论45阅读模式
英文:

Different behaviour of same query in Spark 2.3 vs Spark 3.2

问题

我正在两个版本的Spark中运行一个简单的查询,2.3和3.2。 以下是代码

spark-shell --master yarn --deploy-mode client
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

在Spark 2.3中,返回结果是

+----+
| id |
+----+
| 1  |
| 1  |
+----+

但在Spark 3.2中,返回结果是

org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)

我期望两个版本都有相同的结果,或者至少有一个配置可以使行为保持一致。设置不改变行为

spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False

除此之外,当在相同的情况下使用两个列时,它可以工作

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

进一步的分析表明,这种行为是在2.4版本中引入的。我的意思是,即使在Spark版本2.4中,相同的查询也会失败。

英文:

I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below

spark-shell --master yarn --deploy-mode client
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

In spark 2.3 it returns

+----+
| id |
+----+
| 1  |
| 1  |
+----+

But in spark 3.2 it returns

org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)

I was expecting both versions to have the same result or at least a configuration to make the behavior consistent.
setting don't change behavior

spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False

On top of this, when using both columns in same case, it works

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

Even further analysis points out that this behavior was introduced in 2.4. I mean the same query fails even in spark version 2.4

答案1

得分: 0

默认情况下,Spark 不区分大小写。在 Spark 3.X 中,启用以下选项后,它的工作方式与 Spark 2.3 相同。

spark.conf.set("spark.sql.caseSensitive", "true")

我尝试深入了解 2.3 和 3.2 之间行为的差异。我找到了一个更简单的示例来复现这个问题。在 Spark 2.3 中,默认情况下不区分大小写,这不会失败。

spark.range(1).select("id", "ID").select("id").explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)

我们可以看到 Spark 简化了选择,这样它就不必处理歧义。

然而,在 3.X 中,它会失败。我尝试将 spark.sql.analyzer.failAmbiguousSelfJoin 设置为 false,因为默认情况下它设置为 true (https://spark.apache.org/docs/latest/sql-migration-guide.html),但这并未改变结果。

英文:

By default, spark is not case sensitive. In spark 3.X, with the following option activated, it works the same way as in spark 2.3.

spark.conf.set("spark.sql.caseSensitive", "true")

I tried to dig a little deeper about the difference of behavior between 2.3 and 3.2. I found a simpler example that reproduces the problem. In spark 2.3, without case sensitivity (the default), this does not fail.

spark.range(1).select("id", "ID").select("id").explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)

We see that spark simplifies the select so that it does not have to deal with the ambiguity.

In 3.X however, it fails. I tried setting spark.sql.analyzer.failAmbiguousSelfJoin to false since it was set by default to true (https://spark.apache.org/docs/latest/sql-migration-guide.html) as of 3.0 but that does not change de result.

答案2

得分: 0

错误是在Spark 2.4中引入的,当在表达式下添加代码时。在Spark 2.3中,我们对候选项进行了去重,但后来的代码只对候选项/修剪后的候选项进行了添加去重。一旦我们在计划的属性解析过程中添加去重,行为就与2.3相同。

此修复的PR已合并到Spark 3.4分支。请参见:https://github.com/apache/spark/pull/40258

英文:

The error was introduced in Spark 2.4 when code was added under expression. In Spark 2.3 we had distinct on the candidates, but later code only had candidates/prunedCandidates did not have distinct added. Once we add the distinct while doing resolve of attributes for plan the behavior is same as that of 2.3

PR for this fix is merged in Spark 3.4 branch. See: https://github.com/apache/spark/pull/40258

huangapple
  • 本文由 发表于 2023年2月6日 16:29:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75358930.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定