可以在没有Databricks的情况下使用Delta Lake吗?

huangapple go评论72阅读模式
英文:

Is it feasible to use deltalake without databricks?

问题

  1. 我们在AWS S3上有我们的数据湖。
  2. 在Hive中有元数据,我们有一个小规模的运行集群(我们尚未使用Athena/Glue)。
  3. 我们在Airflow管道中使用Spark和Presto。
  4. 处理后的数据被导入Snowflake。
  5. 数据湖具有各种格式,但主要是Parquet。

我们想要尝试使用Databricks。我们的计划是:

  1. 为整个数据湖创建Deltalake表,而不是Hive表。
  2. 使用Databricks来处理和存储数据的重要部分。
  3. 我们目前无法用Databricks替代Snowflake。
  4. 因此,我们需要使Deltalake表能够被其他Spark管道使用。

上述最后一步,是否可能无需挑战,还是有一些困难?

英文:
  1. We have our datalake in AWS s3.
  2. Metadata in hive, we have a small running cluster.(we havent used Athena/Glue) .
  3. We use spark and presto to in our Airflow pipeline.
  4. The processed data gets dumped into snowflake.
  5. The Detalake has various formats but majorly in parquet.

We want to experiment with Databricks. Our plan is to

  1. Create Deltalake tables instead of hive ones for the entire detalake.
  2. Use Databricks for processing and warehousing for a significant part of the data.
  3. We can not replace snowflake with databricks, at least at this moment.
  4. So we need the deltalake tables to be used by other spark pipelines as well.

This last step above, is it possible this way without challenges or is it tricky ?

答案1

得分: 1

宣布 Delta Lake 在 2022 年 6 月开源了所有功能。因此,从 Delta Lake 本身的功能角度来看,这应该是完全可行的。我已经在 Databricks 之外的生产环境中成功使用了 Delta Lake,它是一个得到广泛支持的开源存储层。

从您的需求清单中,我看到的问题是多个 Spark 流水线同时向 S3 执行写操作。在 Databricks 中,有一个管理的 S3 提交服务,负责在写入操作期间锁定表格。这是必要的,因为 S3 不支持类似其他云存储服务的“如果不存在,则放置”功能。在 Databricks 之外,您将需要设置自己的服务,使用 DynamoDB,具体描述在这里

英文:

It was announced that Delta Lake was open sourcing all features in June 2022. So from a feature perspective for Delta Lake itself, this should be more than feasible. I've used Delta Lake in production outside of Databricks to good effect, it's an open-source storage layer that's widely supported.

The concern I see from your list of requirements is concurrent writes to S3 from multiple Spark pipelines. In Databricks there's a managed S3 commit service that handles locking tables during write operations. This is necessary because S3 doesn't support a "put if absent" functionality like some other cloud storage services. Outside of Databricks you'll have to set up your own service using DynamoDB, described here.

答案2

得分: 0

根据第一个回答所述,这是可行的。我们在本地使用了HDP,并配合使用Hive Delta连接器。我们现在使用所有现在对所有人都可用的服务,即使不在Databricks平台上也可以。

我们将会使用delta格式迁移到GCP(并迁移到BigQuery)。在这方面没有问题。

请参阅 https://stackoverflow.com/questions/66933229/writing-to-google-cloud-storage-with-v2-algorithm-safe 以获取更多讨论,正如第一个答案的第二部分中所提到的。

英文:

As the 1st answer states, it is feasible. We use HDP on-premise with Hive Delta Connector. We use now all the services that are now available for all, even if not on Databricks platform.

We will be moving to gcp with delta format (and moving to BigQuery). No issues there.

See https://stackoverflow.com/questions/66933229/writing-to-google-cloud-storage-with-v2-algorithm-safe for a further discussion as mentioned in 2nd part of 1st answer.

huangapple
  • 本文由 发表于 2023年2月18日 21:45:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75493755.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定