英文:
How to pull/retrieve data directly to the external directory from remote instead of creating local copies using DVC?
问题
我目前正在一个项目中工作,在这个项目中,我想使用DVC从外部位置推送/版本化数据到远程存储,然后随后从远程存储库中检索更新到外部位置。在这个过程中,我打算只保留本地工作区中的.dvc
文件。
为了提供更多背景信息,以下是具体要求:
- 跟踪外部数据集:我有一个位于
/external-to-workspace/dataset
的外部数据集,我想使用DVC跟踪这个数据集。 - 上传数据集版本到DVC远程:我需要使用SSH将不同版本的数据集上传到位于
ssh://example.com/path/to/storage
的DVC远程存储。 - 从DVC远程更新外部数据集:在从远程拉取/检索数据时,我不想创建数据集的本地副本,而是想要直接更新外部数据集。这意味着在DVC远程对数据集进行的更改应该反映在原始外部位置中。
我尝试过dvc import-url --to-remote
,它可以将数据从外部复制到远程,但dvc pull
会创建本地副本,而不是再次拉取到外部目录。
我相信DVC提供了完成这个任务所需的必要功能,但我不确定需要哪些确切的步骤和配置。如果有人有类似设置的经验,或者有关如何实现这一目标的任何建议,我将非常感谢您的指导。
英文:
I'm currently working on a project where I want to push/version data from an external location to a remote storage using DVC, and subsequently pull/retrieve updates from the remote repository back to the external location. In this process, I intend to keep only the .dvc
files in the local workspace.
To provide more context, here are the specific requirements:
- Tracking an External Dataset: I have an external dataset located at
/external-to-workspace/dataset
, and I want to track this dataset using DVC. - Uploading Dataset Versions to DVC Remote: I need to upload different versions of the dataset to a DVC remote using ssh, which is located at
ssh://example.com/path/to/storage
. - Updating the External Dataset from DVC Remote: Instead of creating a local copy of the dataset when pulling/retrieving from the remote, I want to update the external dataset itself. This means that the changes made to the dataset in the DVC remote should be reflected in the original external location.
I tried dvc import-url --to-remote
which copies data from the external to remote but dvc pull
creates local copies instead of pulling to the external directiory again.
I believe DVC provides the necessary functionalities to accomplish this, but I am unsure about the exact steps and configuration required. If anyone has experience with a similar setup or any suggestions on how to achieve this, I would greatly appreciate your guidance.
答案1
得分: 1
在/external-to-workspace/
中初始化一个独立的DVC和Git存储库。您可以在其中使用所有常规命令,如dvc add
、dvc push
等,以对数据进行版本控制并保存到远程。数据位置保持不变,即/external-to-workspace/dataset
,只是会多出一个/external-to-workspace/.dvc
和/external-to-workspace/.git
。为了避免数据重复,请确保启用符号链接。
然后,在项目存储库中,您将有两个选项:
- 直接使用外部存储库中的数据,完全不使用DVC。
- 使用
dvc import
(并设置DVC以与外部存储库共享缓存,以避免数据移动)。
这个方法听起来合理吗?(我可以提供更具体的命令来尝试它)。
英文:
How about this option. In the /external-to-workspace/
initialize a DVC and Git repository of its own. In it you could use all the regular commands like dvc add
, dvc push
, etc to version your data and save it to remote. The data location stays the same /external-to-workspace/dataset
, you would just have an extra /external-to-workspace/.dvc
an /external-to-workspace/.git
. To avoid data duplication make sure that symlinks are enabled.
In the project repository in turn, you will have two options.
- Use the data in the external repo directly, don't use DVC in it at all.
- Use
dvc import
(+setup DVC in a way that it shares the cache with the external repo, so that we don't move data around).
Does it sound reasonable as a approach? (I can give more specific commands on how to try it).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论