表格在尝试从Databricks Spark覆盖其中的数据时被删除。

huangapple go评论76阅读模式
英文:

Table gets deleted when trying to overwrite the data in it from databricks spark

问题

我正在尝试使用pyspark将DataFrame数据写入Azure SQL中的表格,来自Databricks。
表格:dbo.test已经存在于数据库中。在执行以下写操作之前,我能够读取它。

testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
            .option("truncate", "false")\
            .option("url", azure_sql_url).option("dbtable", 'dbo.test')\
            .option("databaseName", database_name)\
            .option("user", username) \
            .option("password", password) \
            .option("encrypt", "true")\
            .option("hostNameInCertificate", "*.database.windows.net")\
            .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
            .option("bulkCopyTimeout", "6000000")\
            .save()

执行此命令后返回以下错误:

java.lang.NoSuchMethodError:
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

令人惊讶的是,dbo.test表被删除了。

有人可以帮我理解为什么会发生这种情况吗?相同的代码在另一个环境中正常工作。

英文:

I am trying to write dataframe data into a table in Azure SQL from Databricks using pyspark.
Table : dbo.test already exists in the database. I am able to read it before I execute below write operation.

testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", azure_sql_url).option("dbtable", 'dbo.test')\
        .option("databaseName", database_name)\
        .option("user", username) \
        .option("password", password) \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

After executing this command the following error is returned:

> java.lang.NoSuchMethodError:
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

Surprisingly, the dbo.test table gets deleted.

Can someone help me understand why this is happening. Same code works fine in another environment.

答案1

得分: 1

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
错误的原因是 Spark SQL 和 JDBC Spark 连接器之间的版本不匹配。另外,`mode("overwrite")` 默认情况下会删除已存在的表并重新创建一个新表。

以下是一些支持的 JDBC Spark 连接器版本及相应的 Spark 版本:

- Spark 2.4.x: com.microsoft.azure:spark-mssql-connector:1.0.2
- Spark 3.0.x: com.microsoft.azure:spark-mssql-connector_2.12:1.1.0
- Spark 3.1.x: com.microsoft.azure:spark-mssql-connector_2.12 :1.2.0

解决方法:

如果你正在使用的 Databricks 运行时版本大于 10,则需要将其降级到 Databricks 运行时版本 9.1 LTS 或更低,并使用相应的连接器。
- 我将 Databricks 运行时版本降级到了 7.3 LTS:
![在此输入图片描述](https://i.imgur.com/oFYCaAo.png)

- 然后安装了适用于 Spark 3.0.x 的相应库,即 `com.microsoft.azure:spark-mssql-connector_2.12:1.1.0`
![在此输入图片描述](https://i.imgur.com/cGfwV4Q.png)

- 然后尝试了你的代码,它可以正常工作。
```python
df_name.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
        .option("databaseName", "databasename")\
        .option("user", "username") \
        .option("password", "password") \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

我的执行结果:
表格在尝试从Databricks Spark覆盖其中的数据时被删除。

输出:
表格在尝试从Databricks Spark覆盖其中的数据时被删除。

或者你可以直接使用 dataframename.format("jdbc")

示例代码:

df_name.write.format("jdbc").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
        .option("databaseName", "databasename")\
        .option("user", "username") \
        .option("password", "password") \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

你也可以在 Github 或类似的 Stack Overflow 线程 上参考类似的问题。

参考链接:https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16


<details>
<summary>英文:</summary>


&gt; java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

The cause of error is ***version mismatch between the Spark SQL and JDBC spark connector.Also, The  `mode(&quot;overwrite&quot;)`  drops the table if already exists by default and re-creates a new one.***

**Here are some supported versions of JDBC spark connector ith respective spark version:**

- **Spark 2.4.x** : com.microsoft.azure:spark-mssql-connector:1.0.2
- **Spark 3.0.x** : com.microsoft.azure:spark-mssql-connector_2.12:1.1.0
- **Spark 3.1.x** : com.microsoft.azure:spark-mssql-connector_2.12 :1.2.0

**Resolution:**

If you are using Databricks runtime version greater than 10 then you need to downgrade it to **Databricks runtime version 9.1 LTS or down.** and use respective connector.
- I downgraded Databricks runtime version to 7.3 LTS:
![enter image description here](https://i.imgur.com/oFYCaAo.png)

- Then installed appropriate library for spark 3.0.x which is `com.microsoft.azure:spark-mssql-connector_2.12:1.1.0`
![enter image description here](https://i.imgur.com/cGfwV4Q.png)

- And tried your code its working fine.
```python
df_name.write.format(&quot;com.microsoft.sqlserver.jdbc.spark&quot;).mode(&quot;overwrite&quot;)\
        .option(&quot;truncate&quot;, &quot;false&quot;)\
        .option(&quot;url&quot;, &quot;Azure_sql_url&quot;).option(&quot;dbtable&quot;, &#39;dbo.test&#39;)\
        .option(&quot;databaseName&quot;, &quot;databasename&quot;)\
        .option(&quot;user&quot;, &quot;username&quot;) \
        .option(&quot;password&quot;, &quot;password&quot;) \
        .option(&quot;encrypt&quot;, &quot;true&quot;)\
        .option(&quot;hostNameInCertificate&quot;, &quot;*.database.windows.net&quot;)\
        .option(&quot;bulkCopyBatchSize&quot;, 10000).option(&quot;bulkCopyTableLock&quot;, &quot;true&quot;)\
        .option(&quot;bulkCopyTimeout&quot;, &quot;6000000&quot;)\
        .save()

My execution:

表格在尝试从Databricks Spark覆盖其中的数据时被删除。

OUTPUT:

表格在尝试从Databricks Spark覆盖其中的数据时被删除。

Or you can directly use dataframename.format(&quot;jdbc&quot;)

Sample CODE:

df_name.write.format(&quot;jdbc&quot;).mode(&quot;overwrite&quot;)\
        .option(&quot;truncate&quot;, &quot;false&quot;)\
        .option(&quot;url&quot;, &quot;Azure_sql_url&quot;).option(&quot;dbtable&quot;, &#39;dbo.test&#39;)\
        .option(&quot;databaseName&quot;, &quot;databasename&quot;)\
        .option(&quot;user&quot;, &quot;username&quot;) \
        .option(&quot;password&quot;, &quot;password&quot;) \
        .option(&quot;encrypt&quot;, &quot;true&quot;)\
        .option(&quot;hostNameInCertificate&quot;, &quot;*.database.windows.net&quot;)\
        .option(&quot;bulkCopyBatchSize&quot;, 10000).option(&quot;bulkCopyTableLock&quot;, &quot;true&quot;)\
        .option(&quot;bulkCopyTimeout&quot;, &quot;6000000&quot;)\
        .save()

You can also refer similar issue here on Github or this similar SO thread

Reference: https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16

huangapple
  • 本文由 发表于 2023年6月16日 12:11:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76486909.html
  • apache-spark-sql
  • azure-sql-database
  • pyspark
  • spark-jdbc
  • sql-server
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定