2023年6月29日 21:36:57go评论141阅读模式

英文:

Syntax error when using Nessie commands with DBT but not using Spark

问题

We are trying to setup an environment using AWS EMR (on EC2), DBT, Spark and Nessie.

Even though all the extensions are installed correctly, and Nessie commands like 'CREATE BRANCH' work on the cluster from Jupyter, and directly from Spark, they do not work as part of DBT.

Regular SQL commands work, and return responses as intended but when trying to create a branch or use one I get a parsing error.

I am using the latest versions possible

here is the stack trace:

ERROR SparkExecuteStatementOperation: Error executing query with a1aab108-eaaa-4c48-9951-5959ca24a038, currentState RUNNING,
org.apache.spark.sql.catalyst.parser.ParseException:
Syntax error at or near 'demo_branch2'(line 3, pos 22)
== SQL ==
/* {"app": "dbt", "dbt_version": "1.5.2", "profile_name": "thrift_tests", "target_name": "dev", "node_id": "model.thrift_tests.my_first_dbt_model"} */
use reference demo_branch2 in dev_catalog
----------------------^^^

EMR (6.10) configuration:

[
  {
    "Classification": "iceberg-defaults",
    "Properties": {
      "iceberg.enabled": "true"
    }
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.63.0",
      "spark.sql.catalog.dev_catalog": "org.apache.iceberg.spark.SparkCatalog",
      "spark.sql.catalog.dev_catalog.auth_type": "NONE",
      "spark.sql.catalog.dev_catalog.catalog-impl": "org.apache.iceberg.nessie.NessieCatalog",
      "spark.sql.catalog.dev_catalog.ref": "main",
      "spark.sql.catalog.dev_catalog.uri": "https://nessie-dev.dev.XYZ.cloud/api/v1",
      "spark.sql.catalog.dev_catalog.warehouse": "s3://.../nessie_catalog/",
      "spark.sql.defaultCatalog": "dev_catalog",
      "spark.sql.execution.pyarrow.enabled": "true",
      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions"
    }
  }
]

I have tried running commands that are not related to Nessie and they were successful

I tried running spark commands for Nessie and they worked.

Everything works as intended from Jupyter

I tried changing the configuration of the thrift server to not validate (using GPT, I was not able to find that myself it might be GPT just making shit up)

I tried running Iceberg commands using DBT and it works.

英文:

We are trying to setup an environment using AWS EMR (on EC2), DBT, Spark and Nessie.

Even though all the extentions are installed correctly, and Nessie commands like 'CREATE BRANCH' work on the cluster from Jupyter, and directly from Spark, they do not work as part of DBT.

Regular SQL commands work, and return responses as intended but when trying to create a branch or use one I get a parsing error.

I am using the latest versions possible

here is the stack trace:

ERROR SparkExecuteStatementOperation: Error executing query with a1aab108-eaaa-4c48-9951-5959ca24a038, currentState RUNNING,
org.apache.spark.sql.catalyst.parser.ParseException:
Syntax error at or near &#39;demo_branch2&#39;(line 3, pos 22)
== SQL ==
/* {&quot;app&quot;: &quot;dbt&quot;, &quot;dbt_version&quot;: &quot;1.5.2&quot;, &quot;profile_name&quot;: &quot;thrift_tests&quot;, &quot;target_name&quot;: &quot;dev&quot;, &quot;node_id&quot;: &quot;model.thrift_tests.my_first_dbt_model&quot;} */
        use reference demo_branch2 in dev_catalog
----------------------^^^
        at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.catalyst.parser.extensions.IcebergSparkSqlExtensionsParser.parsePlan(IcebergSparkSqlExtensionsParser.scala:133) ~[iceberg-spark-runtime-3.3_2.12-1.1.0-amzn-0.jar:?]
        at org.apache.spark.sql.catalyst.parser.extensions.NessieSparkSqlExtensionsParser.parsePlan(NessieSparkSqlExtensionsParser.scala:114) ~[org.projectnessie.nessie-integrations_nessie-spark-extensions-3.3_2.12-0.51.1.jar:?]
        at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:291) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_372]
        at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_372]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) ~[hadoop-client-api-3.3.3-amzn-2.jar:?]
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239) ~[spark-hive-thriftserver_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_372]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_372]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]

EMR (6.10) configuration:

[
  {
    &quot;Classification&quot;: &quot;iceberg-defaults&quot;,
    &quot;Properties&quot;: {
      &quot;iceberg.enabled&quot;: &quot;true&quot;
    }
  },
  {
    &quot;Classification&quot;: &quot;spark-defaults&quot;,
    &quot;Properties&quot;: {
      &quot;spark.jars.packages&quot;: &quot;org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.63.0&quot;,
      &quot;spark.sql.catalog.dev_catalog&quot;: &quot;org.apache.iceberg.spark.SparkCatalog&quot;,
      &quot;spark.sql.catalog.dev_catalog.auth_type&quot;: &quot;NONE&quot;,
      &quot;spark.sql.catalog.dev_catalog.catalog-impl&quot;: &quot;org.apache.iceberg.nessie.NessieCatalog&quot;,
      &quot;spark.sql.catalog.dev_catalog.ref&quot;: &quot;main&quot;,
      &quot;spark.sql.catalog.dev_catalog.uri&quot;: &quot;https://nessie-dev.dev.XYZ.cloud/api/v1&quot;,
      &quot;spark.sql.catalog.dev_catalog.warehouse&quot;: &quot;s3://.../nessie_catalog/&quot;,
      &quot;spark.sql.defaultCatalog&quot;: &quot;dev_catalog&quot;,
      &quot;spark.sql.execution.pyarrow.enabled&quot;: &quot;true&quot;,
      &quot;spark.sql.extensions&quot;: &quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions&quot;
    }
  }
]

I have tried running commands that are not related to Nessie and they were successful

I tried running spark commands for Nessie and they worked.

Everything works as intended from Jupyter

I tried changing the configuration of the thrift server to not validate (using GPT, I was not able to find that myself it might be GPT just making shit up)

I tried running Iceberg commands using DBT and it works.

答案1

得分: 3

请注意，发送到Spark的SQL附带以下注释前缀：

/* {&quot;app&quot;: &quot;dbt&quot;, &quot;dbt_version&quot;: &quot;1.5.2&quot;, &quot;profile_name&quot;: &quot;thrift_tests&quot;, &quot;target_name&quot;: &quot;dev&quot;, &quot;node_id&quot;: &quot;model.thrift_tests.my_first_dbt_model&quot;} */

尽管这个注释在最终的SQL是标准时不会引起问题，但在与nessie命令（如USE REFERENCE，LIST REFERENCES等）一起使用时会引起问题。

为了解决这个问题，我们不得不重写dbt.adapters.spark.session中的execute函数并删除注释。

我们这样做：

original_execute = Cursor.execute
def execute(self, sql: str, *parameters: Any) -&gt; None:
    try:
        sql = sql[sql.find(&quot;*/&quot;) + 3:]
        original_execute(self, sql, *parameters)
    except AnalysisException as exc:
        raise DbtRuntimeError(str(exc)) from exc
Cursor.execute = execute

在上述覆盖之后，我们从代码中运行DBT，成功执行。

AnalysisException包装与在dbt-spark存储库中列出的另一个错误相关。

英文:

Note that the SQL that is sent to spark is sent with a prefix of the following comment:

/* {&quot;app&quot;: &quot;dbt&quot;, &quot;dbt_version&quot;: &quot;1.5.2&quot;, &quot;profile_name&quot;: &quot;thrift_tests&quot;, &quot;target_name&quot;: &quot;dev&quot;, &quot;node_id&quot;: &quot;model.thrift_tests.my_first_dbt_model&quot;} */

Although this comment doesn't cause problems when the SQL in the end is standard, it does cause problem when using it with nessie commands such as USE REFERENCE, LIST REFERENCES and etc.

To overcome this issue, We had to override the function execute in dbt.adapters.spark.session and remove the comment.

We did it as follows:

original_execute = Cursor.execute
def execute(self, sql: str, *parameters: Any) -&gt; None:
    try:
        sql = sql[sql.find(&quot;*/&quot;) + 3:]
        original_execute(self, sql, *parameters)
    except AnalysisException as exc:
        raise DbtRuntimeError(str(exc)) from exc
Cursor.execute = execute

After the override above we run the DBT from the code and it is successful.
The AnalysisException wrap is related to a different bug that is listed in dbt-spark repo.

答案2

得分: 1

根据 @user502490 建议，问题源于在每个查询开头插入的查询注释。另一种解决方法是在 dot_project.yml 中将 query-comment 字段设置为 null。这将阻止 DBT 将元数据注释添加到所有查询中。

# dbt_project.yml
...
query-comment: null
...

英文:

As @user502490 suggested, the problem stems from the query comment inserted at the beginning of each query. An alternative solution is you can set the query-comment field in the dot_project.yml to null. This will prevent DBT from adding the metadata comment to all queries.

# dbt_project.yml
...
query-comment: null
...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Nessie命令在使用DBT时出现语法错误，但不在使用Spark时出现。

问题

答案1

答案2

spark-3.0.1-bin-hadoop2.7启动失败，java：没有该文件或目录

在AWS Glue中写入BigQuery时出现空指针异常。

How to Iterate though scala dataframe rows and store the column name in variables which can be used for some opertions inside for loop?

如何在Java Spark中转换嵌套的结构体（不支持的NullType）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。