英文:
How to add existing data via dvc?
问题
I have some data in s3 in an AWS account. 我在AWS账户中有一些数据。
I want to use that in a new machine learning project. 我想在新的机器学习项目中使用这些数据。
To be able to use that data and track it via DVC, do I need to download the data first to my local machine and then add it via DVC add command? 为了能够使用这些数据并通过DVC进行跟踪,我是否需要先将数据下载到本地机器,然后再使用DVC add命令添加它?
I understand this will add it to the local cache on my machine and generate a hash, write it to .dvc files for tracking purposes. 我理解这将将数据添加到我的本地缓存,并生成哈希值,将其写入.dvc文件以进行跟踪。
As the data already exists on S3, I wouldn't need to do a DVC push after DVC add. 由于数据已经存在于S3上,我在DVC add之后不需要执行DVC push。Is my logic right here? 我的逻辑正确吗?
英文:
I have some data in s3 in an aws account. i want to use that in a new machine learning project that i want to work on. to be able to use that data and track that data via dvc, do i need to download the data first to my local machine first and then add it via dvc add command. I understand this will add it lo local cache in my machine and generate hash , write it to .dvc files for tracking purposes. as the data already exists on the s3 , i wouldn't need to do a dvc push after dvc add.
is my logic right here?
答案1
得分: 1
-
如果您不想先将文件下载到本地,有两个选项。
-
如果您不想将数据“push”回远程,可以使用外部输入。
您可以使用
dvc add --external
完成此操作,详情请参阅 https://dvc.org/doc/user-guide/data-management/managing-external-data。这将与您的远程工作,并且不会将数据“push”回任何远程。您还可以查看此问题,以查看使用示例 https://stackoverflow.com/questions/67104752/dvc-add-external-s3-mybucket-data-csv-is-failing-with-access-error-even-aft。
-
-
如果您愿意将生成物“push”回远程(应该是不同的远程或同一远程中的不同路径),可以使用
dvc import-url
,详情请参阅 https://dvc.org/doc/command-reference/import-url。
通常,推荐使用后者,因为在执行此操作时可以减少错误的发生。您可以查阅 https://dvc.org/doc/user-guide/data-management/managing-external-data 了解更多有关此建议背后的动机。
英文:
There are two option if you don't want to download the file locally first.
-
If you don't want to
push
your data back to a remote, you can use external inputs.You can do that with
dvc add --external
https://dvc.org/doc/user-guide/data-management/managing-external-data. This will work with your remote and won'tpush
data back to any remote.You can also check out this question to see an example of using that https://stackoverflow.com/questions/67104752/dvc-add-external-s3-mybucket-data-csv-is-failing-with-access-error-even-aft
-
If you're ok to
push
your artifact back to a remote (it should be a different remote, or different path in the same remote), you can usedvc import-url
https://dvc.org/doc/command-reference/import-url
Generally, the latter is preferred due to less mistakes you can do while doing so. You can check out https://dvc.org/doc/user-guide/data-management/managing-external-data for more motivation behind this recommendation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论