英文:
Table empty in Azure Synapse lake database but parquet files are present in the data lake
问题
我一直在尝试在Azure Synapse中使用CETAS方法从专用SQL池中导出数据到数据湖,然后使用湖数据库来映射/读取创建的parquet文件。
问题出现在当我尝试查询在湖数据库中创建的表时,它只返回没有结果,并显示错误消息'no dataset found for myexampletable'。
我已经检查了文件是否位于数据湖的正确位置,与湖数据库中指定的表的位置相匹配。当我从专用SQL池中导出数据时,使用的压缩格式是'snappy',并且我还在湖数据库表设置中设置了压缩为'snappy'。请注意,我尝试在SQL池和湖数据库中都使用未压缩的parquet格式,但仍然出现相同的错误...
我唯一注意到的是,数据湖中的文件带有扩展名'????.parq.snappy'或'????.parq',而不是通常的'????.parquet',例如。不确定这是否只是专用SQL池命名文件的方式?
我尝试将它们重命名为'????.parquet',这解决了问题。问题是如何让湖库读取'.parq'或'.parq.snappy'文件?或者是否有办法让Synapse专用SQL池使用'正常'文件扩展名导出?
有什么想法如何解决这个问题?
编辑:添加了我使用的示例代码(名称已更改)
CREATE EXTERNAL DATA SOURCE [SomeExternalDataSource] WITH (
TYPE = HADOOP
,LOCATION = N'abfss://somecontainer@somestorageaccount.dfs.core.windows.net/SomeFolder'
)
CREATE EXTERNAL FILE FORMAT [CompressedParquetFormat] WITH (
FORMAT_TYPE = PARQUET
,DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
)
CREATE EXTERNAL TABLE [dbo].[SomeExternalTable] WITH (
LOCATION = 'FolderOne/FolderTwo/Etc'
,DATA_SOURCE = [SomeExternalDataSource]
,FILE_FORMAT = [CompressedParquetFormat]
) AS
SELECT
[Field1]
,[Field2]
,[and_so_on]
FROM
[SomeSchema].[SomeTable]
英文:
I've been playing around with the lake databases in Azure Synapse and have been trying to export data from my dedicated SQL pool using CETAS method to the datalake then using a lake database to map/read the parquet files that were created.
The problem comes when I try to query the table created in the lake database, it just returns no results, with an error listed 'no dataset found for myexampletable'.
I've checked and the files are in the correct location in the datalake which match the location of the table specified in the lake database. The compression used is 'snappy' when I'm exporting from the dedicated SQL pool and I've also set the compression to 'snappy' in the lake database table settings. Note that I've tried using uncompressed parquet format in both the SQL pool and the lake database, same error...
The only thing I've noticed is that the files in the datalake are created with an extension '????.parq.snappy' or '????.parq' as opposed to the usual '????.parquet' for example. Not sure if this just the dedicated SQL pool's way of naming the files?
I tried renaming them to '????.parquet' and this fixes the problem. The question is how do we get the lakehouse to read '.parq' or '.parq.snappy' files? Or is there a way to get Synapse dedicated SQL pool to export using the 'normal' file extension?
Any ideas how to fix this?
Edit: added example code I'm using (names changed of course)
CREATE EXTERNAL DATA SOURCE [SomeExternalDataSource] WITH (
TYPE = HADOOP
,LOCATION = N'abfss://somecontainer@somestorageaccount.dfs.core.windows.net/SomeFolder'
)
CREATE EXTERNAL FILE FORMAT [CompressedParquetFormat] WITH (
FORMAT_TYPE = PARQUET
,DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
)
CREATE EXTERNAL TABLE [dbo].[SomeExternalTable] WITH (
LOCATION = 'FolderOne/FolderTwo/Etc'
,DATA_SOURCE = [SomeExternalDataSource]
,FILE_FORMAT = [CompressedParquetFormat]
) AS
SELECT
[Field1]
,[Field2]
,[and_so_on]
FROM
[SomeSchema].[SomeTable]
答案1
得分: 0
The '.parq'或'.parq.snappy'扩展名不是Parquet文件的标准扩展名,这可能会导致在尝试查询湖数据库中的表时出现问题。
将SQL查询重写为以下语法:
IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'CompressedParquetFormat')
CREATE EXTERNAL FILE FORMAT [CompressedParquetFormat]
WITH ( FORMAT_TYPE = PARQUET)
GO
IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'SomeExternalDataSource')
CREATE EXTERNAL DATA SOURCE [SomeExternalDataSource]
WITH (
LOCATION = 'abfss://pool@vamblob.dfs.core.windows.net'
)
GO
CREATE EXTERNAL TABLE dbo.dem12 (
[registration_dttm] datetime2(7),
[id] int,
[first_name] nvarchar(4000),
[last_name] nvarchar(4000),
[email] nvarchar(4000),
[gender] nvarchar(4000),
[ip_address] nvarchar(4000),
[cc] nvarchar(4000),
[country] nvarchar(4000),
[birthdate] nvarchar(4000),
[salary] float,
[title] nvarchar(4000),
[comments] nvarchar(4000)
)
WITH (
LOCATION = 'userdata3.parquet',
DATA_SOURCE = [SomeExternalDataSource],
FILE_FORMAT = [CompressedParquetFormat]
)
GO
select * from dbo.dem12
或者
另一种解决此问题的方法是使用Azure Synapse管道,将Parquet文件复制到数据湖中的另一个位置,并使用正确的文件扩展名'.parquet',然后创建湖数据库并从新位置读取文件。
参考链接:
CREATE EXTERNAL TABLE (Transact-SQL) - SQL Server | Microsoft Learn
英文:
The '.parq' or '.parq.snappy' extensions are not the standard extensions for Parquet files, which may be causing issues when trying to query the table in the lake database.
Reframe SQL query as per below syntax:
IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'CompressedParquetFormat')
CREATE EXTERNAL FILE FORMAT [CompressedParquetFormat]
WITH ( FORMAT_TYPE = PARQUET)
GO
IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'SomeExternalDataSource')
CREATE EXTERNAL DATA SOURCE [SomeExternalDataSource]
WITH (
LOCATION = 'abfss://pool@vamblob.dfs.core.windows.net'
)
GO
CREATE EXTERNAL TABLE dbo.dem12 (
[registration_dttm] datetime2(7),
[id] int,
[first_name] nvarchar(4000),
[last_name] nvarchar(4000),
[email] nvarchar(4000),
[gender] nvarchar(4000),
[ip_address] nvarchar(4000),
[cc] nvarchar(4000),
[country] nvarchar(4000),
[birthdate] nvarchar(4000),
[salary] float,
[title] nvarchar(4000),
[comments] nvarchar(4000)
)
WITH (
LOCATION = 'userdata3.parquet',
DATA_SOURCE = [SomeExternalDataSource],
FILE_FORMAT = [CompressedParquetFormat]
)
GO
select * from dbo.dem12
Or
Another way you can resolve this issue, Using Azure synapse pipeline, copy the parquet files to another location in the data lake with the correct file extension .parquet,
and then create a lake database and read the files from the new location.
Reference:
CREATE EXTERNAL TABLE (Transact-SQL) - SQL Server | Microsoft Learn
答案2
得分: 0
以下是翻译好的部分:
问题是专用 SQL 池导出 Parquet 文件时使用了 '.parq' 文件扩展名,而不是更常见/标准的 '.parquet' 扩展名。数据湖数据库当然在寻找 '.parquet' 文件(而不是 '.parq' 文件),因此尽管文件存在,但没有找到任何文件,因此表是空的。
简而言之,数据湖数据库和无服务器池是建立在使用 '.parquet' 文件扩展名的基础上的,但专用 SQL 池会写入 '.parq' 文件。为什么在同一 '产品' 中要求文件扩展名一致呢,哈哈!?在这里插入一下脸掌声...
我已向 Microsoft 报告了这个问题,他们已确认无法更改文件扩展名为正常/标准的 '.parquet' 扩展名,您必须使用它生成的文件名和扩展名。我已经询问他们是否会在未来更新(或修复;-) 有关此 '问题' 的 SQL 专用池,但我仍在等待回复...
作为一种解决方法,我仍然从专用 SQL 池中导出,然后必须稍后更改文件扩展名(即使用 API 重命名文件)。一旦完成,它们将被数据湖数据库拾取... 哎!
英文:
Just in case anyone else runs into this problem...
The problem was the fact that the dedicated SQL pool was exporting the parquet files using the '.parq' file extension instead of the more common/standard '.parquet' extension. The lake database was of course looking for '.parquet' files (not '.parq' files), so even though the files were there it was not finding any, hence empty tables.
In short, the lake databases and serverless pool are built to use '.parquet' file extensions, but the dedicated SQL pool writes '.parq' files. Why on Earth would anyone want file extensions to be consistent in the same 'product', lol!?! Insert face palm here...
I've reported this to Microsoft and they've confirmed you cannot change the file extension to the normal/standard '.parquet' extension, you're stuck with the file names and extension it generates. I've asked if they will update (or fix SQL dedicated pools regarding this 'issue' in the future, but I'm still awaiting reply...
As a workaround I'm still exporting from dedicated SQL pool but then I have to change the file extensions afterwards (i.e. renaming files using the API). Once that's done they get picked up by the lake database... doh!
Here's the link to my post reporting this to Microsoft and their replies
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论