英文:
Delta File Format
问题
Delta文件格式:
现在,关于数据处理的不同文件格式变得越来越流行。
其中之一是由Databricks开发并开源的Delta格式。
它最重要的特性是ACID(其他特性包括支持upsert/delete和数据治理)。
我的问题是:
这个文件系统是否像服务器一样运行 - 一个进程持续运行并响应请求?
(类似于Hadoop文件系统,其中基本文件系统是Unix文件系统,而在其上运行HDFS,其中名称节点管理HDFS文件并响应文件系统请求)。
但如果这是真的(Delta格式有进程在运行),我在任何文章中都没有看到描述Delta文件格式中的服务器进程。
那么,是什么负责其特性(ACID、upsert、delete、数据治理等等)?
此外,有人说许多其他工具可以与Delta文件格式交互。
例如:DBT(一种基于SQL的转换工具)可以读写数据。
如果是这样的话,哪个进程负责提供上述功能?
还提到Delta格式只支持表格。如果是这样的话,它是否是关系型数据库管理系统产品?
我只是想弄清楚这个文件格式在哪个级别运作。
对于HDFS,它非常清楚,它在主机操作系统文件系统的顶部运行,不同的进程(名称节点、数据节点等)可用于与其交互。
同样,我对Delta格式没有任何明确的了解。
任何帮助将不胜感激。
谢谢。
英文:
Delta file format:
Nowadays different file formats with respect to data processing are becoming popular.
One of them is Delta format developed and open sourced by Databricks.
Its most important feature is ACID (others being - support for upsert/delete and data governance).
My question is:
Does this file system act like a server - a process keeps running and responds to requests?
[Analogous to Hadoop file system, where the base file system is Unix File System and on top of it, HDFS operates where name node manages the HDFS files and responds to file system requests].
But if this is to be true (Delta format has processes running), I don't see any server process described in Delta file format in any articles.
So, what is responsible for its features (ACID, upsert, delete, data governance etc..) ?
To add on, it is told that many other tools can interact with Delta file format.
For example: DBT (a SQL based transformation tool) can read/write data.
If this is the case, which process is responsible to provide the aforementioned features?
Also it is mentioned that Delta format supports only tables. If yes, is it a RDBMS product?
Just I am trying to understand at which level this file format is operating.
For HDFS, it very clear that it operates on top of host OS file system and different processes (name node, data node etc) are available to interact with it.
Similarly I don't get any clarity about the Delta format.
Any help will be much appreciated.
Thanks
答案1
得分: 1
Delta Lake本身只是一种文件格式,可以在其基础上构建许多功能。数据存储在某种存储介质中(云端或本地)。仍然需要一些过程来接受数据处理命令并执行它们。可以通过不同的方式来实现:
-
Apache Spark及其构建在其之上的工具,如Databricks SQL Warehouse。这是Delta的最初用例,也是目前最受欢迎的用法。通常,第三方工具通过ODBC/JDBC与Apache Spark集成。
-
专用的连接器,例如Trino、PrestoDB等,允许与Delta Lake表一起工作。
-
Rust和Python API允许在没有Apache Spark的情况下使用Delta表。
英文:
Delta Lake by itself just a file format that allows to build many features on top of it. And data is stored in some storage (cloud or on-premise). It still requires some processes to accept data processing commands and execute them. That could be done different ways:
-
Apache Spark & tools built on top of it, like, Databricks SQL Warehouse. That was initial use case for Delta, and most popular as of right now. Often, 3rd party tools are integrated with Apache Spark via ODBC/JDBC
-
Specialized connectors, like, for Trino, PrestoDB, ... allows to work with Delta Lake tables.
-
Rust & Python APIs allows to work with Delta tables without Apache Spark.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论