2023年3月7日 15:33:44go评论128阅读模式

英文:

Whitespace in column name in Synapse

问题

我正在使用Databricks从Azure Synapse表中读取数据到Spark Dataframe。

df = spark.read \
          .format("com.databricks.spark.sqldw") \
          .option("url", sql_dw_connection_string) \
          .option("tempDir", temp_dir_url) \
          .option("enableServicePrincipalAuth", "true") \
          .option("query", query) \
          .load()
df.cache()

当我尝试对Dataframe进行缓存时，在Databricks中出现以下错误：

com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Make sure column names do not include any invalid characters such as ';'' or whitespace
Underlying SQLException(s):
  - com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/sqldw-staging/2023-03-07/07-14-49-790/f030341a-12d7-4ae1-b5e0-81b510a9a9e1/QID68547779_20230307_71451_0.parq.snappy][0]: Java exception raised on call to HdfsBridge_CreateRecordWriter. Java exception message:
HdfsBridge::createRecordWriter - Unexpected error encountered when creating a record writer: IllegalArgumentException: field ended by ';': expected ';'' but got 'CODE' at line 9:   optional binary ATC CODE [ErrorCode = 110802] [SQLState = S0001]

原因是Azure Synapse表中有一个名为“ATC CODE”的列，该列名包含空格。我在互联网上看到一些解决方案，可以将空格替换为下划线，但根据项目要求，我不应该将空格替换为其他字符。

有没有办法解决这个问题？

英文:

I am using Databricks to read data to a Spark Dataframe from Azure Synapse table.

df = spark.read \
          .format(&quot;com.databricks.spark.sqldw&quot;) \
          .option(&quot;url&quot;, sql_dw_connection_string) \
          .option(&quot;tempDir&quot;, temp_dir_url) \
          .option(&quot;enableServicePrincipalAuth&quot;, &quot;true&quot;) \
          .option(&quot;query&quot;, query) \
          .load()
df.cache()

When I try to cache the Dataframe, I am getting the below error in Databricks.

    com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Make sure column names do not include any invalid characters such as &#39;;&#39; or whitespace
Underlying SQLException(s):
  - com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/sqldw-staging/2023-03-07/07-14-49-790/f030341a-12d7-4ae1-b5e0-81b510a9a9e1/QID68547779_20230307_71451_0.parq.snappy][0]: Java exception raised on call to HdfsBridge_CreateRecordWriter. Java exception message:
HdfsBridge::createRecordWriter - Unexpected error encountered when creating a record writer: IllegalArgumentException: field ended by &#39;;&#39;: expected &#39;;&#39; but got &#39;CODE&#39; at line 9:   optional binary ATC CODE [ErrorCode = 110802] [SQLState = S0001]

The reason being is that in the Azure Synapse table have a column named ATC CODE where this column name have whitespace. I saw some solution in internet to replace whitespace with underscore but as per the project requirement, I am not suppose to replace whitespace with any other character.

Is there any way to resolve this?

答案1

得分: 3

根据Microsoft文档，像空格或分号这样的无效符号现在在Azure Synapse连接中作为错误返回，当它们出现在列名中时。

对于列名First Name和Last Name，我也遇到了类似的错误。

由于不允许在数据库中更改列名，为解决这个问题，您可以编写一个查询，选择包含空格的列作为别名，带下划线。您还可以创建一个视图，并在查询中选择该视图。

select Id, [First Name] as First_name, [Last Name] as Last_name from demotb

然后，您将获得带下划线的列名，以替代空格，如下所示：

然后，您可以通过以下方式将下划线_替换为空格：

from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace('_', ' ')) for col in df.columns])

英文:

As per Microsoft document, Invalid symbols like whitespace or semicolons are now returned by the Azure Synapse connection as an error when it found in a column name.

I also got similar error for column names First Name and Last Name.

Synapse中列名中的空格

As you don't allow to change the column in database, to workaround this issue you can write a query to select column contain spaces as alias with underscore. Also you can create view for that and select that view in query.

select Id, [First Name] as First_name, [Last Name] as Last_name from demotb

then you will get column name with underscore in place of space as below:

Synapse中列名中的空格

Then you can replace the _ with space by

from pyspark.sql import functions as F
renamed_df = df.select([F.col(col).alias(col.replace(&#39;_&#39;, &#39; &#39;)) for col in df.columns])

Synapse中列名中的空格

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Synapse中列名中的空格

问题

答案1

根据绝对差异筛选数据框。

如何使用R从百分比数值的调查数据绘制一个中性居中的Likert图？

访问特定日期并获取前一天的收盘价

根据特定列计算百分比。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论