如何直接从远程检索数据到外部目录,而不是使用DVC创建本地副本?

huangapple go评论60阅读模式
英文:

How to pull/retrieve data directly to the external directory from remote instead of creating local copies using DVC?

问题

我目前正在一个项目中工作,在这个项目中,我想使用DVC从外部位置推送/版本化数据到远程存储,然后随后从远程存储库中检索更新到外部位置。在这个过程中,我打算只保留本地工作区中的.dvc文件

为了提供更多背景信息,以下是具体要求:

  • 跟踪外部数据集:我有一个位于/external-to-workspace/dataset的外部数据集,我想使用DVC跟踪这个数据集。
  • 上传数据集版本到DVC远程:我需要使用SSH将不同版本的数据集上传到位于ssh://example.com/path/to/storage的DVC远程存储。
  • 从DVC远程更新外部数据集:在从远程拉取/检索数据时,我不想创建数据集的本地副本,而是想要直接更新外部数据集。这意味着在DVC远程对数据集进行的更改应该反映在原始外部位置中。

我尝试过dvc import-url --to-remote,它可以将数据从外部复制到远程,但dvc pull会创建本地副本,而不是再次拉取到外部目录。

我相信DVC提供了完成这个任务所需的必要功能,但我不确定需要哪些确切的步骤和配置。如果有人有类似设置的经验,或者有关如何实现这一目标的任何建议,我将非常感谢您的指导。

英文:

I'm currently working on a project where I want to push/version data from an external location to a remote storage using DVC, and subsequently pull/retrieve updates from the remote repository back to the external location. In this process, I intend to keep only the .dvc files in the local workspace.

To provide more context, here are the specific requirements:

  • Tracking an External Dataset: I have an external dataset located at /external-to-workspace/dataset, and I want to track this dataset using DVC.
  • Uploading Dataset Versions to DVC Remote: I need to upload different versions of the dataset to a DVC remote using ssh, which is located at ssh://example.com/path/to/storage.
  • Updating the External Dataset from DVC Remote: Instead of creating a local copy of the dataset when pulling/retrieving from the remote, I want to update the external dataset itself. This means that the changes made to the dataset in the DVC remote should be reflected in the original external location.

I tried dvc import-url --to-remote which copies data from the external to remote but dvc pull creates local copies instead of pulling to the external directiory again.

I believe DVC provides the necessary functionalities to accomplish this, but I am unsure about the exact steps and configuration required. If anyone has experience with a similar setup or any suggestions on how to achieve this, I would greatly appreciate your guidance.

答案1

得分: 1

/external-to-workspace/中初始化一个独立的DVC和Git存储库。您可以在其中使用所有常规命令,如dvc adddvc push等,以对数据进行版本控制并保存到远程。数据位置保持不变,即/external-to-workspace/dataset,只是会多出一个/external-to-workspace/.dvc/external-to-workspace/.git。为了避免数据重复,请确保启用符号链接。

然后,在项目存储库中,您将有两个选项:

  1. 直接使用外部存储库中的数据,完全不使用DVC。
  2. 使用dvc import(并设置DVC以与外部存储库共享缓存,以避免数据移动)。

这个方法听起来合理吗?(我可以提供更具体的命令来尝试它)。

英文:

How about this option. In the /external-to-workspace/ initialize a DVC and Git repository of its own. In it you could use all the regular commands like dvc add, dvc push, etc to version your data and save it to remote. The data location stays the same /external-to-workspace/dataset, you would just have an extra /external-to-workspace/.dvc an /external-to-workspace/.git. To avoid data duplication make sure that symlinks are enabled.

In the project repository in turn, you will have two options.

  1. Use the data in the external repo directly, don't use DVC in it at all.
  2. Use dvc import (+setup DVC in a way that it shares the cache with the external repo, so that we don't move data around).

Does it sound reasonable as a approach? (I can give more specific commands on how to try it).

huangapple
  • 本文由 发表于 2023年7月7日 01:01:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76631074.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定