英文:
Different behaviour of same query in Spark 2.3 vs Spark 3.2
问题
我正在两个版本的Spark中运行一个简单的查询,2.3和3.2。 以下是代码
spark-shell --master yarn --deploy-mode client
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()
在Spark 2.3中,返回结果是
+----+
| id |
+----+
| 1 |
| 1 |
+----+
但在Spark 3.2中,返回结果是
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)
我期望两个版本都有相同的结果,或者至少有一个配置可以使行为保持一致。设置不改变行为
spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False
除此之外,当在相同的情况下使用两个列时,它可以工作
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()
进一步的分析表明,这种行为是在2.4版本中引入的。我的意思是,即使在Spark版本2.4中,相同的查询也会失败。
英文:
I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below
spark-shell --master yarn --deploy-mode client
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()
In spark 2.3 it returns
+----+
| id |
+----+
| 1 |
| 1 |
+----+
But in spark 3.2 it returns
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)
I was expecting both versions to have the same result or at least a configuration to make the behavior consistent.
setting don't change behavior
spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False
On top of this, when using both columns in same case, it works
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()
Even further analysis points out that this behavior was introduced in 2.4. I mean the same query fails even in spark version 2.4
答案1
得分: 0
默认情况下,Spark 不区分大小写。在 Spark 3.X 中,启用以下选项后,它的工作方式与 Spark 2.3 相同。
spark.conf.set("spark.sql.caseSensitive", "true")
我尝试深入了解 2.3 和 3.2 之间行为的差异。我找到了一个更简单的示例来复现这个问题。在 Spark 2.3 中,默认情况下不区分大小写,这不会失败。
spark.range(1).select("id", "ID").select("id").explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)
我们可以看到 Spark 简化了选择,这样它就不必处理歧义。
然而,在 3.X 中,它会失败。我尝试将 spark.sql.analyzer.failAmbiguousSelfJoin
设置为 false
,因为默认情况下它设置为 true (https://spark.apache.org/docs/latest/sql-migration-guide.html),但这并未改变结果。
英文:
By default, spark is not case sensitive. In spark 3.X, with the following option activated, it works the same way as in spark 2.3.
spark.conf.set("spark.sql.caseSensitive", "true")
I tried to dig a little deeper about the difference of behavior between 2.3 and 3.2. I found a simpler example that reproduces the problem. In spark 2.3, without case sensitivity (the default), this does not fail.
spark.range(1).select("id", "ID").select("id").explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)
We see that spark simplifies the select so that it does not have to deal with the ambiguity.
In 3.X however, it fails. I tried setting spark.sql.analyzer.failAmbiguousSelfJoin
to false
since it was set by default to true (https://spark.apache.org/docs/latest/sql-migration-guide.html) as of 3.0 but that does not change de result.
答案2
得分: 0
错误是在Spark 2.4中引入的,当在表达式下添加代码时。在Spark 2.3中,我们对候选项进行了去重,但后来的代码只对候选项/修剪后的候选项进行了添加去重。一旦我们在计划的属性解析过程中添加去重,行为就与2.3相同。
此修复的PR已合并到Spark 3.4分支。请参见:https://github.com/apache/spark/pull/40258
英文:
The error was introduced in Spark 2.4 when code was added under expression. In Spark 2.3 we had distinct on the candidates, but later code only had candidates/prunedCandidates did not have distinct added. Once we add the distinct while doing resolve of attributes for plan the behavior is same as that of 2.3
PR for this fix is merged in Spark 3.4 branch. See: https://github.com/apache/spark/pull/40258
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论