Replies: 1 comment
-
First, if we talk about the data. The short answer - it's quite flexible and you can setup DVC with data in the same repo or in a separate one. Both ways can work. It primarily depends on the teams workflow, if data processing needs some code from the repo that does modeling, if you are fine to have commits / PRs that are related to data changes in the same repo. If there are a lot of datasets ( If you decide to do a separate repo for data, I would take a look at this approach: https://dvc.org/doc/use-cases/data-registry
Again, up to you how to manage this. If your datasets history is non-linear in its nature then branches makes sense. Branches makes sense to create an update to a dataset (review and merge it back to main). If datasets history is linear, you kinda can think of it as v1, v2, etc. it makes sense to have it in the main branch, assign optionally Git tags (like Please don't hesitate to followup with your questions, to some extent we need to understand better your workflow to recommend something specific. |
Beta Was this translation helpful? Give feedback.
-
What are the best practices for using DVC in prod automation and pipelines?
Should we have a separate repo for DVC other than the repo we have for multiple models? We're using sagemaker for model building and deployment.. so looking for DVC for data version alone.
Should we create separate branch for each models' dataset ? or all the commits should happen in main/master branch?
In the existing repo, we're trying out CI/CD whenever we push to main branch. So having DVC also in the same repo will work?
Beta Was this translation helpful? Give feedback.
All reactions