英文:
Can Azure Data Factory read data from Delta Lake format?
问题
我们能够通过在ADF中将Delta文件源指定为Parquet数据集来读取文件。尽管这会读取Delta文件,但它最终会读取Delta文件中所有版本/快照的数据,而不是专门选择Delta数据的最新版本。
然而,我想从ADLS Gen2位置读取Delta文件。感谢任何关于此的指导。
英文:
We were able to read the files by specifiying the delta file source as a parquet dataset in ADF. Although this reads the delta file, it ends up reading all versions/snapshots of the data in the delta file instead of specifically picking up the most recent version of the delta data.
There is a similar question here - https://stackoverflow.com/questions/57917908/is-it-possible-to-connect-to-databricks-deltalake-tables-from-adf?noredirect=1&lq=1
However, I am looking to read the delta file from an ADLS Gen2 location. Appreciate any guidance on this.
答案1
得分: 5
我认为你不能像今天从Parquet文件中读取一样轻松地完成它,因为Delta Lake文件基本上是Parquet格式的事务日志文件+快照。除非你在从Delta Lake目录读取之前每次都执行VACUUM操作,否则你将会读取快照数据,就像你观察到的那样。
在Databricks之外,Delta Lake文件不太友好。
在我们的数据流水线中,通常会有一个Databricks笔记本,将数据从Delta Lake格式导出到临时位置的常规Parquet格式。我们让ADF读取Parquet文件,并在完成后进行清理。根据你的数据大小和使用方式,这可能或可能不是你的选择。
英文:
I don't think you can do it as easily as reading from Parquet files today, because the Delta Lake files are basically transaction log files + snapshots in Parquet format. Unless you VACUUM every time before you read from a Delta Lake directory, you are going to end up readying the snapshot data like you have observed.
Delta Lake files do not play very nicely OUTSIDE OF Databricks.
In our data pipeline, we usually have a Databricks notebook that exports data from Delta Lake format to regular Parquet format in a temporary location. We let ADF read the Parquet files and do the clean up once done. Depending on the size of your data and how you use it, this may or may not be an option for you.
答案2
得分: 2
时间已经过去,现在ADF Delta支持Data Flow已经进入预览阶段...希望它很快能够成为ADF的本机功能。
https://learn.microsoft.com/en-us/azure/data-factory/format-delta
英文:
Time has passed and now ADF Delta support for Data Flow is in preview... hopefully it makes it into ADF native soon.
https://learn.microsoft.com/en-us/azure/data-factory/format-delta
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论