问题

我在我的 Databricks 环境中创建了一个数据库，它被挂载到 AWS S3 位置。是否有一种方法可以对数据库进行快照，以便我可以将其存储到不同的地方，并在出现故障时进行恢复？

英文:

I have a database created in my Databricks environment which is mounted to an AWS S3 location. Is there a way to take the snapshot of the database so that I can store it to different place and restore it in case of any failure?

答案1

得分: 1

Databricks不同于传统数据库，所有数据都储存在数据库内部。举例来说，Amazon RDS提供了一个“快照”功能，可以将整个数据库的内容备份，并在需要时将该快照还原到新的数据库服务器上。

在Databricks中的等价功能是Delta Lake时间旅行，它允许您在以前的某个时间点访问数据库的内容。数据不是“还原”的，而是以给定时间戳之前的状态呈现。这是一个快照，而不需要实际创建快照。

来自配置时间旅行的数据保留：

要时间旅行到以前的版本，您必须保留该版本的日志和数据文件。

Delta表的数据文件不会自动删除；数据文件仅在运行VACUUM时才会被删除。VACUUM不会删除Delta日志文件；日志文件会在检查点写入后自动清理。

如果您确实想要保留数据库的“快照”，一个好的方法是创建表的深度克隆，其中包括所有数据。请参考：

我认为您需要编写自己的脚本来循环遍历每个表并执行此操作。这不像在Amazon RDS中单击“创建快照”按钮那么简单。

英文:

Databricks is not like a traditional database where all data is stored "inside" the database. For example, Amazon RDS provides a "snapshot" feature that can dump the entire contents of a database, and the snapshot can then be restored to a new database server if required.

The equivalent in Databricks would be Delta Lake time travel, which allows you to access the database as it was at a previous point-in-time. Data is not "restored" -- rather, it is simply presented as it previously was at a given timestamp. It is a snapshot without the need to actually create a snapshot.

From Configure data retention for time travel:

> To time travel to a previous version, you must retain both the log and the data files for that version.
>
>The data files backing a Delta table are never deleted automatically; data files are deleted only when you run VACUUM. VACUUM does not delete Delta log files; log files are automatically cleaned up after checkpoints are written.

If, instead, you do want to keep a "snapshot" of the database, a good method would be to create a deep clone of a table, which includes all data. See:

I think you would need to write your own script to loop through each table and perform this operation. It is not as simple as clicking the "Create Snapshot" button in Amazon RDS.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

创建 Databricks 数据库快照

问题

答案1

AWS CloudWatch：如何指定在JSON中使用哪个字段作为时间戳？

将所有文件从S3存储桶移动到冷归档（Glacier Deep Archive），不包括一列文件。

“DynamoDB表单在AWS CLI中运作，但使用Boto3不起作用。”

使用Databricks SQL如何提取表名和列名列表？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论