从Pyspark使用JDBC更新表格

huangapple go评论87阅读模式
英文:

update table from Pyspark using JDBC

问题

抱歉,你的代码部分不需要翻译。以下是你要翻译的文本内容:

"I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.

Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.

however, I cannot figure out the correct sytnax to update a table given a set of conditions :

the statement I use to append a single row is as follows :

spark_log.write.jdbc(sql_url, 'internal.Job', mode='append')

this works swimmingly however, as my Data Factory is invoking a stored procedure,

I need to work in a query like

query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?

Looking at the documentation this only seems to mention using select statements with the query parameter :

Target Database is an Azure SQL Database.

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

just to add this is a tiny operation, so performance is a non-issue."

英文:

I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.

Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.

however, I cannot figure out the correct sytnax to update a table given a set of conditions :

the statement I use to append a single row is as follows :

spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')

this works swimmingly however, as my Data Factory is invoking a stored procedure,

I need to work in a query like

query  = f"""
UPDATE [internal].[Job] SET
     [MaxIngestionDate]                date                   {date}
,    [DataLakeMetadataRaw]            varchar(MAX)            NULL
,    [DataLakeMetadataCurated]        varchar(MAX)            NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""

Is this possible ? if so can someone show me how?

Looking at the documentation this only seems to mention using select statements with the query parameter :

Target Database is an Azure SQL Database.

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

just to add this is a tiny operation, so performance is a non-issue.

答案1

得分: 3

你无法使用jdbc在Spark中的数据框中进行单个记录的更新。你只能追加或替换整个表格。

你可以使用pyodbc来进行更新 - 需要安装MSSQL ODBC驱动程序(https://stackoverflow.com/questions/54132249/how-to-install-pyodbc-in-databricks)或者可以通过JayDeBeApi来使用jdbc(https://pypi.org/project/JayDeBeApi/)。

英文:

You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.

You can do updates using pyodbc- requires installing the MSSQL ODBC driver (https://stackoverflow.com/questions/54132249/how-to-install-pyodbc-in-databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)

huangapple
  • 本文由 发表于 2020年1月6日 20:42:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/59612337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定