如何通过dvc添加现有数据?

huangapple go评论38阅读模式
英文:

How to add existing data via dvc?

问题

I have some data in s3 in an AWS account. 我在AWS账户中有一些数据。
I want to use that in a new machine learning project. 我想在新的机器学习项目中使用这些数据。
To be able to use that data and track it via DVC, do I need to download the data first to my local machine and then add it via DVC add command? 为了能够使用这些数据并通过DVC进行跟踪,我是否需要先将数据下载到本地机器,然后再使用DVC add命令添加它?
I understand this will add it to the local cache on my machine and generate a hash, write it to .dvc files for tracking purposes. 我理解这将将数据添加到我的本地缓存,并生成哈希值,将其写入.dvc文件以进行跟踪。
As the data already exists on S3, I wouldn't need to do a DVC push after DVC add. 由于数据已经存在于S3上,我在DVC add之后不需要执行DVC push。Is my logic right here? 我的逻辑正确吗?

英文:

I have some data in s3 in an aws account. i want to use that in a new machine learning project that i want to work on. to be able to use that data and track that data via dvc, do i need to download the data first to my local machine first and then add it via dvc add command. I understand this will add it lo local cache in my machine and generate hash , write it to .dvc files for tracking purposes. as the data already exists on the s3 , i wouldn't need to do a dvc push after dvc add.

is my logic right here?

答案1

得分: 1

  1. 如果您不想先将文件下载到本地,有两个选项。

  2. 如果您愿意将生成物“push”回远程(应该是不同的远程或同一远程中的不同路径),可以使用 dvc import-url,详情请参阅 https://dvc.org/doc/command-reference/import-url

通常,推荐使用后者,因为在执行此操作时可以减少错误的发生。您可以查阅 https://dvc.org/doc/user-guide/data-management/managing-external-data 了解更多有关此建议背后的动机。

英文:

There are two option if you don't want to download the file locally first.

  1. If you don't want to push your data back to a remote, you can use external inputs.

    You can do that with dvc add --external https://dvc.org/doc/user-guide/data-management/managing-external-data. This will work with your remote and won't push data back to any remote.

    You can also check out this question to see an example of using that https://stackoverflow.com/questions/67104752/dvc-add-external-s3-mybucket-data-csv-is-failing-with-access-error-even-aft

  2. If you're ok to push your artifact back to a remote (it should be a different remote, or different path in the same remote), you can use dvc import-url https://dvc.org/doc/command-reference/import-url

Generally, the latter is preferred due to less mistakes you can do while doing so. You can check out https://dvc.org/doc/user-guide/data-management/managing-external-data for more motivation behind this recommendation.

huangapple
  • 本文由 发表于 2023年5月26日 08:27:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76336953.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定