Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVCX datasets #10114

Closed
Tracked by #9945
skshetry opened this issue Nov 28, 2023 · 16 comments · Fixed by #10287
Closed
Tracked by #9945

DVCX datasets #10114

skshetry opened this issue Nov 28, 2023 · 16 comments · Fixed by #10287
Assignees
Labels
p1-important Important, aka current backlog of things to do

Comments

@skshetry
Copy link
Member

skshetry commented Nov 28, 2023

No description provided.

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Dec 1, 2023
@dberenbaum dberenbaum added this to DVC Dec 1, 2023
@dberenbaum dberenbaum moved this to In Progress in DVC Dec 1, 2023
@skshetry
Copy link
Member Author

skshetry commented Dec 1, 2023

Does streaming dataset affect us in any way? I assume that'll be in user's code rather than dvc (and cached by dvcx). Is that just a dependency support without materializing?

@skshetry
Copy link
Member Author

skshetry commented Dec 2, 2023

Okay, looks like for streaming datasets, we will need some APIs to redirect a tracked dataset in dvc to the appropriate dataset in dql.

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

Discussed with @dmpetrov about this. He does not want dvc to compute md5 hashes for pipeline checksums, and wants that to be handled by dql itself, as in ask dql for workspace status.
His point was that there is: database, support for returning different formats of data like json pairs/coco and the actual files in "dvcx datasets", that only dql has a good knowledge about.

This would require us to decouple checksums that we use for pipelines and md5 hash that we use for data management.

There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.

This would be a matter of figuring out how the API will work, and how the translation works.

dvc stage add -d ds://dogs@v1 ...

(For this, we'll need to figure out how the hypothetical get_dataset() api in dvc will work and how we'll pass the data inside).

I think we should do a single integration (or start with just one). Given the challenges with former, maybe it's better to start with the latter one.

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?

Eg:

$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...
from dvc.api import dataset

dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml

Passing data through envvars won't work if you want to let users do python script.py outside of dvc repro/dvc exp run/dvclive.

Also, I am not quite convinced we need a stage-level dependency here, since there are not going to be multiple versions of the same dataset, and it's already registered globally.

(We'll need to know if it changed or not to rerun stages).

@dberenbaum
Copy link
Collaborator

dberenbaum commented Dec 8, 2023

Just sent an invite for the 3 of us to discuss next week.

There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.

So these are the possible approaches I see for accessing DVCX datasets. Does it look correct to you two?

DVC DVCX - Frame 2(1)

This corresponds to phase 1 and phase 2 in notion.

@skshetry Are you saying we should focus on approach 2?

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

In the second diagram, dvc will have to act as an intermediary.

The dataset info needs to be tracked in dvc's metadata, not in the source code, so that when users update the metadata (via Studio or through CLI), dvc will retrigger stages that changed.

At the same time, the dataset that is being referenced in source code should change to the newer dataset, and should be used for training.

Users could update in both places - in source code and in metadata, but making this one-click as possible - in Studio and in CLI, and be able to retrigger your pipelines and run your changes seems to be a feature that we want here.

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

Are you saying we should focus on approach 2?

I am saying that we should start with one integration first.

Regarding the first approach, there is a strict requirement here not to touch the dataset there and also technical challenges here. Given this, I am not sure if dvc needs to be even involved?

Can users use dql to materialize it? Can it be just a dependency or an external dependency from dvc's point of view, and make dql the command to manipulate/checkout/pull datasets?

@dberenbaum
Copy link
Collaborator

Thanks @skshetry. I agree that starting with approach 2 makes sense, just wanted to clarify that's what you meant. I also agree DVC needs awareness of the dataset, but I'm not convinced we need top-level datasets.

Within approach 2, I see two different scenarios:

a. Fixed versions -- I want to specify the exact version every time
b. Latest version -- I always want to use the latest but need to snapshot it for reproducibility

2a looks simple to me. Use the DVCX API directly in your code and parametrize it with DVC:

# params.yaml
dataset:
  name: dogs
  version: 2
# dvc.yaml
stages:
  train:
    cmd: python train.py ${dataset}
ds = DatasetQuery(name=name, version=version)

2b is where we need DVC to snapshot the version, but only because the user doesn't want to specify it in code, so it can look like this:

# dvc.yaml
stages:
  train:
    cmd: python train.py
      deps:
        - ds://dogs
# dvc.lock
train:
  cmd: python train.py
  deps:
    - path: ds://dogs
    - version: v2
ds = DatasetQuery(name=dogs)

update the metadata (via Studio or through CLI)

In 2a, it can be done by updating the parameters.

In 2b, the update will happen automatically.

Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?

I would prefer not to because:

  • It's a more complex UX -- need another abstraction to add the dataset to DVCX first, then need to reference that with special syntax in dependencies and in code, and can't use code outside dvc
  • It's less consistent with how we treat other datasets -- other dependencies don't require any dvc-specific syntax in code and don't need to be defined outside the pipeline first
  • It's more work to implement

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

Maybe I am over optimizing here and is unnecessary, but dependencies can point to multiple dataset versions. That's what I was trying to avoid with this registration.

train:
  cmd: python train.py
  deps:
    - path: ds://dogs
      version: v2
    - path: ds://dogs
      version: v1

Secondly, they are stage-local. You can have a stage with a dataset of a version and a separate stage with a different version.

We could easily forbid this though (at least in that DatasetQuery/get_dataset api.

Thirdly, you'll have to update in multiple places than just one. But that's a minor issue.

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

@dberenbaum, how would dvc and Studio understand which dataset is being used in 2a?

@skshetry
Copy link
Member Author

skshetry commented Dec 8, 2023

Also, do you think 2b) is important? Why don't we push for pinning the dataset for stronger reproducibility?

User can update dvc.yaml when they want to update (and could be one-click from Studio if needed).

@dberenbaum
Copy link
Collaborator

Okay, I think you've convinced me on having top-level datasets as long as we are all aligned that the focus is on approach 2. I have some thoughts on minor modifications, and I do think 2b is important as it's probably the default workflow for a lot of people, but let's take the weekend to think about it and we can follow up on details next week.

@dberenbaum
Copy link
Collaborator

Adding some additional thoughts. Maybe you thought of some of these but didn't bother to detail yet.

dvc.yaml

$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...

Maybe we can generalize this more? We might want to reuse datasets in the future for other data sources (databases, cloud-versioned data, data from other repos). It seems like this is going in the direction of #2378, where datasets act like callback dependencies that don't rely on file hashing. WDYT about something like this (or we could keep the ds:// syntax but include dvcx somewhere)?

#dvc.yaml
dataset:
  dogs:
    type: dvcx
    name: dogs
    version: v2
stages:
  train:
    deps:
      - dataset: dogs

Also wondering how should it look in dvc.lock?

python api

from dvc.api import dataset

dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml

Can dataset.get() (or some method) return info about the dataset instead of actually reading the dataset? For example:

from dql.query import DatasetQuery
from dvc.api import dataset

ds = DatasetQuery(**dataset.get("dogs"))

As DVCX adds more methods and ways to stream data like pytorch data loaders, etc., we don't want to have to add more DVC API methods to support every DVCX way to load data. Users can keep their native DVCX code and only use DVC to provide the dataset info.

@dmpetrov
Copy link
Member

@dberenbaum looks great!

the direction of #2378, where datasets act like callback dependencies that don't rely on file hashing

Yes. It's related - you ask update status using an external tool (DVCX, DBT or a custom code). IDK how much the generalization is needed at this point.

Also wondering how should it look in dvc.lock?

I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.

@dberenbaum
Copy link
Collaborator

I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.

Related to #982. Is it important when creating and deleting a dataset that you can reuse the same version numbers? If we have a UUID or timestamp, can DVC still get the deleted dataset to reproduce (assuming the raw storage hasn't changed)? What is the point of version numbers if we also plan to have UUID/timestamp?

@dmpetrov
Copy link
Member

dmpetrov commented Dec 12, 2023

Name and version are human-readable.

It would be the same if dvc imports a file from a git repository. File name and tags are nice but it's better to use sha to a proper snapshot.

@dberenbaum dberenbaum moved this from In Progress to Review In Progress in DVC Feb 6, 2024
@dberenbaum dberenbaum moved this from Review In Progress to In Progress in DVC Feb 6, 2024
@skshetry skshetry linked a pull request Feb 23, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants