Synapse中列名中的空格

huangapple go评论94阅读模式
英文:

Whitespace in column name in Synapse

问题

我正在使用Databricks从Azure Synapse表中读取数据到Spark Dataframe。

df = spark.read \
          .format("com.databricks.spark.sqldw") \
          .option("url", sql_dw_connection_string) \
          .option("tempDir", temp_dir_url) \
          .option("enableServicePrincipalAuth", "true") \
          .option("query", query) \
          .load()
df.cache()

当我尝试对Dataframe进行缓存时,在Databricks中出现以下错误:

com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Make sure column names do not include any invalid characters such as ';'' or whitespace
Underlying SQLException(s):
  - com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/sqldw-staging/2023-03-07/07-14-49-790/f030341a-12d7-4ae1-b5e0-81b510a9a9e1/QID68547779_20230307_71451_0.parq.snappy][0]: Java exception raised on call to HdfsBridge_CreateRecordWriter. Java exception message:
HdfsBridge::createRecordWriter - Unexpected error encountered when creating a record writer: IllegalArgumentException: field ended by ';': expected ';'' but got 'CODE' at line 9:   optional binary ATC CODE [ErrorCode = 110802] [SQLState = S0001]

原因是Azure Synapse表中有一个名为“ATC CODE”的列,该列名包含空格。我在互联网上看到一些解决方案,可以将空格替换为下划线,但根据项目要求,我不应该将空格替换为其他字符。

有没有办法解决这个问题?

英文:

I am using Databricks to read data to a Spark Dataframe from Azure Synapse table.

df = spark.read \
          .format("com.databricks.spark.sqldw") \
          .option("url", sql_dw_connection_string) \
          .option("tempDir", temp_dir_url) \
          .option("enableServicePrincipalAuth", "true") \
          .option("query", query) \
          .load()
df.cache()

When I try to cache the Dataframe, I am getting the below error in Databricks.

    com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Make sure column names do not include any invalid characters such as ';' or whitespace
Underlying SQLException(s):
  - com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/sqldw-staging/2023-03-07/07-14-49-790/f030341a-12d7-4ae1-b5e0-81b510a9a9e1/QID68547779_20230307_71451_0.parq.snappy][0]: Java exception raised on call to HdfsBridge_CreateRecordWriter. Java exception message:
HdfsBridge::createRecordWriter - Unexpected error encountered when creating a record writer: IllegalArgumentException: field ended by ';': expected ';' but got 'CODE' at line 9:   optional binary ATC CODE [ErrorCode = 110802] [SQLState = S0001]

The reason being is that in the Azure Synapse table have a column named ATC CODE where this column name have whitespace. I saw some solution in internet to replace whitespace with underscore but as per the project requirement, I am not suppose to replace whitespace with any other character.

Is there any way to resolve this?

答案1

得分: 3

根据Microsoft文档,像空格或分号这样的无效符号现在在Azure Synapse连接中作为错误返回,当它们出现在列名中时。

对于列名First NameLast Name,我也遇到了类似的错误。

  • 由于不允许在数据库中更改列名,为解决这个问题,您可以编写一个查询,选择包含空格的列作为别名,带下划线。您还可以创建一个视图,并在查询中选择该视图。
select Id, [First Name] as First_name, [Last Name] as Last_name from demotb

然后,您将获得带下划线的列名,以替代空格,如下所示:

  • 然后,您可以通过以下方式将下划线_替换为空格
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace('_', ' ')) for col in df.columns])
英文:

As per Microsoft document, Invalid symbols like whitespace or semicolons are now returned by the Azure Synapse connection as an error when it found in a column name.

I also got similar error for column names First Name and Last Name.

Synapse中列名中的空格

  • As you don't allow to change the column in database, to workaround this issue you can write a query to select column contain spaces as alias with underscore. Also you can create view for that and select that view in query.
select Id, [First Name] as First_name, [Last Name] as Last_name from demotb

then you will get column name with underscore in place of space as below:

Synapse中列名中的空格

  • Then you can replace the _ with space by
from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace('_', ' ')) for col in df.columns])

Synapse中列名中的空格

huangapple
  • 本文由 发表于 2023年3月7日 15:33:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75659098.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定