英文:
Whitespace in column name in Synapse
问题
我正在使用Databricks从Azure Synapse表中读取数据到Spark Dataframe。
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", sql_dw_connection_string) \
.option("tempDir", temp_dir_url) \
.option("enableServicePrincipalAuth", "true") \
.option("query", query) \
.load()
df.cache()
当我尝试对Dataframe进行缓存时,在Databricks中出现以下错误:
com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Make sure column names do not include any invalid characters such as ';'' or whitespace
Underlying SQLException(s):
- com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/sqldw-staging/2023-03-07/07-14-49-790/f030341a-12d7-4ae1-b5e0-81b510a9a9e1/QID68547779_20230307_71451_0.parq.snappy][0]: Java exception raised on call to HdfsBridge_CreateRecordWriter. Java exception message:
HdfsBridge::createRecordWriter - Unexpected error encountered when creating a record writer: IllegalArgumentException: field ended by ';': expected ';'' but got 'CODE' at line 9: optional binary ATC CODE [ErrorCode = 110802] [SQLState = S0001]
原因是Azure Synapse表中有一个名为“ATC CODE”的列,该列名包含空格。我在互联网上看到一些解决方案,可以将空格替换为下划线,但根据项目要求,我不应该将空格替换为其他字符。
有没有办法解决这个问题?
英文:
I am using Databricks to read data to a Spark Dataframe from Azure Synapse table.
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", sql_dw_connection_string) \
.option("tempDir", temp_dir_url) \
.option("enableServicePrincipalAuth", "true") \
.option("query", query) \
.load()
df.cache()
When I try to cache the Dataframe, I am getting the below error in Databricks.
com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Make sure column names do not include any invalid characters such as ';' or whitespace
Underlying SQLException(s):
- com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/sqldw-staging/2023-03-07/07-14-49-790/f030341a-12d7-4ae1-b5e0-81b510a9a9e1/QID68547779_20230307_71451_0.parq.snappy][0]: Java exception raised on call to HdfsBridge_CreateRecordWriter. Java exception message:
HdfsBridge::createRecordWriter - Unexpected error encountered when creating a record writer: IllegalArgumentException: field ended by ';': expected ';' but got 'CODE' at line 9: optional binary ATC CODE [ErrorCode = 110802] [SQLState = S0001]
The reason being is that in the Azure Synapse table have a column named ATC CODE
where this column name have whitespace. I saw some solution in internet to replace whitespace with underscore but as per the project requirement, I am not suppose to replace whitespace with any other character.
Is there any way to resolve this?
答案1
得分: 3
根据Microsoft文档,像空格或分号这样的无效符号现在在Azure Synapse连接中作为错误返回,当它们出现在列名中时。
对于列名First Name
和Last Name
,我也遇到了类似的错误。
- 由于不允许在数据库中更改列名,为解决这个问题,您可以编写一个查询,选择包含空格的列作为别名,带下划线。您还可以创建一个视图,并在查询中选择该视图。
select Id, [First Name] as First_name, [Last Name] as Last_name from demotb
然后,您将获得带下划线的列名,以替代空格,如下所示:
- 然后,您可以通过以下方式将下划线
_
替换为空格
:
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace('_', ' ')) for col in df.columns])
英文:
As per Microsoft document, Invalid symbols like whitespace or semicolons are now returned by the Azure Synapse connection as an error when it found in a column name.
I also got similar error for column names First Name
and Last Name
.
- As you don't allow to change the column in database, to workaround this issue you can write a query to select column contain spaces as alias with underscore. Also you can create view for that and select that view in query.
select Id, [First Name] as First_name, [Last Name] as Last_name from demotb
then you will get column name with underscore in place of space as below:
- Then you can replace the
_
withspace
by
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace('_', ' ')) for col in df.columns])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论