2023年2月6日 16:29:48go评论116阅读模式

英文:

Different behaviour of same query in Spark 2.3 vs Spark 3.2

问题

我正在两个版本的Spark中运行一个简单的查询，2.3和3.2。以下是代码

spark-shell --master yarn --deploy-mode client

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

在Spark 2.3中，返回结果是

+----+
| id |
+----+
| 1  |
| 1  |
+----+

但在Spark 3.2中，返回结果是

org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)

我期望两个版本都有相同的结果，或者至少有一个配置可以使行为保持一致。设置不改变行为

spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False

除此之外，当在相同的情况下使用两个列时，它可以工作

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

进一步的分析表明，这种行为是在2.4版本中引入的。我的意思是，即使在Spark版本2.4中，相同的查询也会失败。

英文:

I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below

spark-shell --master yarn --deploy-mode client

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF(&quot;id&quot;,&quot;col2&quot;,&quot;col3&quot;,&quot;col4&quot;, &quot;col5&quot;)
val op_cols = List(&quot;id&quot;,&quot;col2&quot;,&quot;col3&quot;,&quot;col4&quot;, &quot;col5&quot;, &quot;ID&quot;)
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select(&quot;id&quot;).show()

In spark 2.3 it returns

+----+
| id |
+----+
| 1  |
| 1  |
+----+

But in spark 3.2 it returns

org.apache.spark.sql.AnalysisException: Reference &#39;id&#39; is ambiguous, could be: id, id.;
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)

I was expecting both versions to have the same result or at least a configuration to make the behavior consistent.
setting don't change behavior

spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False

On top of this, when using both columns in same case, it works

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF(&quot;id&quot;,&quot;col2&quot;,&quot;col3&quot;,&quot;col4&quot;, &quot;col5&quot;)
val op_cols = List(&quot;id&quot;,&quot;col2&quot;,&quot;col3&quot;,&quot;col4&quot;, &quot;col5&quot;, &quot;id&quot;)
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select(&quot;id&quot;).show()

Even further analysis points out that this behavior was introduced in 2.4. I mean the same query fails even in spark version 2.4

答案1

得分: 0

默认情况下，Spark 不区分大小写。在 Spark 3.X 中，启用以下选项后，它的工作方式与 Spark 2.3 相同。

spark.conf.set("spark.sql.caseSensitive", "true")

我尝试深入了解 2.3 和 3.2 之间行为的差异。我找到了一个更简单的示例来复现这个问题。在 Spark 2.3 中，默认情况下不区分大小写，这不会失败。

spark.range(1).select("id", "ID").select("id").explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)

我们可以看到 Spark 简化了选择，这样它就不必处理歧义。

然而，在 3.X 中，它会失败。我尝试将 spark.sql.analyzer.failAmbiguousSelfJoin 设置为 false，因为默认情况下它设置为 true (https://spark.apache.org/docs/latest/sql-migration-guide.html)，但这并未改变结果。

英文:

By default, spark is not case sensitive. In spark 3.X, with the following option activated, it works the same way as in spark 2.3.

spark.conf.set(&quot;spark.sql.caseSensitive&quot;, &quot;true&quot;)

I tried to dig a little deeper about the difference of behavior between 2.3 and 3.2. I found a simpler example that reproduces the problem. In spark 2.3, without case sensitivity (the default), this does not fail.

spark.range(1).select(&quot;id&quot;, &quot;ID&quot;).select(&quot;id&quot;).explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)

We see that spark simplifies the select so that it does not have to deal with the ambiguity.

In 3.X however, it fails. I tried setting spark.sql.analyzer.failAmbiguousSelfJoin to false since it was set by default to true (https://spark.apache.org/docs/latest/sql-migration-guide.html) as of 3.0 but that does not change de result.

答案2

得分: 0

错误是在Spark 2.4中引入的，当在表达式下添加代码时。在Spark 2.3中，我们对候选项进行了去重，但后来的代码只对候选项/修剪后的候选项进行了添加去重。一旦我们在计划的属性解析过程中添加去重，行为就与2.3相同。

此修复的PR已合并到Spark 3.4分支。请参见：https://github.com/apache/spark/pull/40258

英文:

The error was introduced in Spark 2.4 when code was added under expression. In Spark 2.3 we had distinct on the candidates, but later code only had candidates/prunedCandidates did not have distinct added. Once we add the distinct while doing resolve of attributes for plan the behavior is same as that of 2.3

PR for this fix is merged in Spark 3.4 branch. See: https://github.com/apache/spark/pull/40258

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark 2.3与Spark 3.2中相同查询的不同行为

问题

答案1

答案2

实体解析 – 基于3列创建唯一标识符

如何将模式设置到spark.sql.function.from_csv中？

Unable to write to redshift via PySpark.

如何在pyspark中根据另一列将列转换为列表

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论