英文:
Table gets deleted when trying to overwrite the data in it from databricks spark
问题
我正在尝试使用pyspark将DataFrame数据写入Azure SQL中的表格,来自Databricks。
表格:dbo.test已经存在于数据库中。在执行以下写操作之前,我能够读取它。
testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
.option("truncate", "false")\
.option("url", azure_sql_url).option("dbtable", 'dbo.test')\
.option("databaseName", database_name)\
.option("user", username) \
.option("password", password) \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
执行此命令后返回以下错误:
java.lang.NoSuchMethodError:
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
令人惊讶的是,dbo.test表被删除了。
有人可以帮我理解为什么会发生这种情况吗?相同的代码在另一个环境中正常工作。
英文:
I am trying to write dataframe data into a table in Azure SQL from Databricks using pyspark.
Table : dbo.test already exists in the database. I am able to read it before I execute below write operation.
testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
.option("truncate", "false")\
.option("url", azure_sql_url).option("dbtable", 'dbo.test')\
.option("databaseName", database_name)\
.option("user", username) \
.option("password", password) \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
After executing this command the following error is returned:
> java.lang.NoSuchMethodError:
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
Surprisingly, the dbo.test table gets deleted.
Can someone help me understand why this is happening. Same code works fine in another environment.
答案1
得分: 1
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
错误的原因是 Spark SQL 和 JDBC Spark 连接器之间的版本不匹配。另外,`mode("overwrite")` 默认情况下会删除已存在的表并重新创建一个新表。
以下是一些支持的 JDBC Spark 连接器版本及相应的 Spark 版本:
- Spark 2.4.x: com.microsoft.azure:spark-mssql-connector:1.0.2
- Spark 3.0.x: com.microsoft.azure:spark-mssql-connector_2.12:1.1.0
- Spark 3.1.x: com.microsoft.azure:spark-mssql-connector_2.12 :1.2.0
解决方法:
如果你正在使用的 Databricks 运行时版本大于 10,则需要将其降级到 Databricks 运行时版本 9.1 LTS 或更低,并使用相应的连接器。
- 我将 Databricks 运行时版本降级到了 7.3 LTS:
![在此输入图片描述](https://i.imgur.com/oFYCaAo.png)
- 然后安装了适用于 Spark 3.0.x 的相应库,即 `com.microsoft.azure:spark-mssql-connector_2.12:1.1.0`
![在此输入图片描述](https://i.imgur.com/cGfwV4Q.png)
- 然后尝试了你的代码,它可以正常工作。
```python
df_name.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
.option("truncate", "false")\
.option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
.option("databaseName", "databasename")\
.option("user", "username") \
.option("password", "password") \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
我的执行结果:
输出:
或者你可以直接使用 dataframename.format("jdbc")
示例代码:
df_name.write.format("jdbc").mode("overwrite")\
.option("truncate", "false")\
.option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
.option("databaseName", "databasename")\
.option("user", "username") \
.option("password", "password") \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
你也可以在 Github 或类似的 Stack Overflow 线程 上参考类似的问题。
参考链接:https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16
<details>
<summary>英文:</summary>
> java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;
The cause of error is ***version mismatch between the Spark SQL and JDBC spark connector.Also, The `mode("overwrite")` drops the table if already exists by default and re-creates a new one.***
**Here are some supported versions of JDBC spark connector ith respective spark version:**
- **Spark 2.4.x** : com.microsoft.azure:spark-mssql-connector:1.0.2
- **Spark 3.0.x** : com.microsoft.azure:spark-mssql-connector_2.12:1.1.0
- **Spark 3.1.x** : com.microsoft.azure:spark-mssql-connector_2.12 :1.2.0
**Resolution:**
If you are using Databricks runtime version greater than 10 then you need to downgrade it to **Databricks runtime version 9.1 LTS or down.** and use respective connector.
- I downgraded Databricks runtime version to 7.3 LTS:
![enter image description here](https://i.imgur.com/oFYCaAo.png)
- Then installed appropriate library for spark 3.0.x which is `com.microsoft.azure:spark-mssql-connector_2.12:1.1.0`
![enter image description here](https://i.imgur.com/cGfwV4Q.png)
- And tried your code its working fine.
```python
df_name.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
.option("truncate", "false")\
.option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
.option("databaseName", "databasename")\
.option("user", "username") \
.option("password", "password") \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
My execution:
OUTPUT:
Or you can directly use dataframename.format("jdbc")
Sample CODE:
df_name.write.format("jdbc").mode("overwrite")\
.option("truncate", "false")\
.option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
.option("databaseName", "databasename")\
.option("user", "username") \
.option("password", "password") \
.option("encrypt", "true")\
.option("hostNameInCertificate", "*.database.windows.net")\
.option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
You can also refer similar issue here on Github or this similar SO thread
Reference: https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论