Best practices for DVC #9435

Selva163 · 2023-05-10T13:08:06Z

Selva163
May 10, 2023

What are the best practices for using DVC in prod automation and pipelines?

Should we have a separate repo for DVC other than the repo we have for multiple models? We're using sagemaker for model building and deployment.. so looking for DVC for data version alone.

Should we create separate branch for each models' dataset ? or all the commits should happen in main/master branch?

In the existing repo, we're trying out CI/CD whenever we push to main branch. So having DVC also in the same repo will work?

shcheklein · 2023-05-10T20:53:46Z

shcheklein
May 10, 2023
Maintainer

First, if we talk about the data.

The short answer - it's quite flexible and you can setup DVC with data in the same repo or in a separate one. Both ways can work. It primarily depends on the teams workflow, if data processing needs some code from the repo that does modeling, if you are fine to have commits / PRs that are related to data changes in the same repo. If there are a lot of datasets (.dvc files) it might slow down though some other things (e.g. if use DVC also to store and version models).

If you decide to do a separate repo for data, I would take a look at this approach: https://dvc.org/doc/use-cases/data-registry

dvc import and dvc get (+ DVC Python APIs) were designed to connect multiple repositories nicely to manage data or reuse models also between different projects.

Should we create separate branch for each models' dataset ? or all the commits should happen in main/master branch?

Again, up to you how to manage this. If your datasets history is non-linear in its nature then branches makes sense. Branches makes sense to create an update to a dataset (review and merge it back to main). If datasets history is linear, you kinda can think of it as v1, v2, etc. it makes sense to have it in the main branch, assign optionally Git tags (like v1, etc).

Please don't hesitate to followup with your questions, to some extent we need to understand better your workflow to recommend something specific.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for DVC #9435

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Best practices for DVC #9435

Selva163 May 10, 2023

Replies: 1 comment

shcheklein May 10, 2023 Maintainer

Selva163
May 10, 2023

shcheklein
May 10, 2023
Maintainer