DVCX datasets #10114

skshetry · 2023-11-28T03:06:07Z

No description provided.

skshetry · 2023-12-01T16:34:18Z

Does streaming dataset affect us in any way? I assume that'll be in user's code rather than dvc (and cached by dvcx). Is that just a dependency support without materializing?

skshetry · 2023-12-02T04:42:21Z

Okay, looks like for streaming datasets, we will need some APIs to redirect a tracked dataset in dvc to the appropriate dataset in dql.

skshetry · 2023-12-08T06:35:10Z

Discussed with @dmpetrov about this. He does not want dvc to compute md5 hashes for pipeline checksums, and wants that to be handled by dql itself, as in ask dql for workspace status.
His point was that there is: database, support for returning different formats of data like json pairs/coco and the actual files in "dvcx datasets", that only dql has a good knowledge about.

This would require us to decouple checksums that we use for pipelines and md5 hash that we use for data management.

There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.

This would be a matter of figuring out how the API will work, and how the translation works.

dvc stage add -d ds://dogs@v1 ...

(For this, we'll need to figure out how the hypothetical get_dataset() api in dvc will work and how we'll pass the data inside).

I think we should do a single integration (or start with just one). Given the challenges with former, maybe it's better to start with the latter one.

skshetry · 2023-12-08T07:40:25Z

Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?

Eg:

$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...

from dvc.api import dataset

dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml

Passing data through envvars won't work if you want to let users do python script.py outside of dvc repro/dvc exp run/dvclive.

~~Also, I am not quite convinced we need a stage-level dependency here, since there are not going to be multiple versions of the same dataset, and it's already registered globally.~~

(We'll need to know if it changed or not to rerun stages).

dberenbaum · 2023-12-08T12:38:48Z

Just sent an invite for the 3 of us to discuss next week.

There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.

So these are the possible approaches I see for accessing DVCX datasets. Does it look correct to you two?

This corresponds to phase 1 and phase 2 in notion.

@skshetry Are you saying we should focus on approach 2?

skshetry · 2023-12-08T13:00:58Z

In the second diagram, dvc will have to act as an intermediary.

The dataset info needs to be tracked in dvc's metadata, not in the source code, so that when users update the metadata (via Studio or through CLI), dvc will retrigger stages that changed.

At the same time, the dataset that is being referenced in source code should change to the newer dataset, and should be used for training.

Users could update in both places - in source code and in metadata, but making this one-click as possible - in Studio and in CLI, and be able to retrigger your pipelines and run your changes seems to be a feature that we want here.

skshetry · 2023-12-08T13:43:59Z

Are you saying we should focus on approach 2?

I am saying that we should start with one integration first.

Regarding the first approach, there is a strict requirement here not to touch the dataset there and also technical challenges here. Given this, I am not sure if dvc needs to be even involved?

Can users use dql to materialize it? Can it be just a dependency or an external dependency from dvc's point of view, and make dql the command to manipulate/checkout/pull datasets?

dberenbaum · 2023-12-08T13:51:04Z

Thanks @skshetry. I agree that starting with approach 2 makes sense, just wanted to clarify that's what you meant. I also agree DVC needs awareness of the dataset, but I'm not convinced we need top-level datasets.

Within approach 2, I see two different scenarios:

a. Fixed versions -- I want to specify the exact version every time
b. Latest version -- I always want to use the latest but need to snapshot it for reproducibility

2a looks simple to me. Use the DVCX API directly in your code and parametrize it with DVC:

# params.yaml
dataset:
  name: dogs
  version: 2

# dvc.yaml
stages:
  train:
    cmd: python train.py ${dataset}

ds = DatasetQuery(name=name, version=version)

2b is where we need DVC to snapshot the version, but only because the user doesn't want to specify it in code, so it can look like this:

# dvc.yaml
stages:
  train:
    cmd: python train.py
      deps:
        - ds://dogs

# dvc.lock
train:
  cmd: python train.py
  deps:
    - path: ds://dogs
    - version: v2

ds = DatasetQuery(name=dogs)

update the metadata (via Studio or through CLI)

In 2a, it can be done by updating the parameters.

In 2b, the update will happen automatically.

Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?

I would prefer not to because:

It's a more complex UX -- need another abstraction to add the dataset to DVCX first, then need to reference that with special syntax in dependencies and in code, and can't use code outside dvc
It's less consistent with how we treat other datasets -- other dependencies don't require any dvc-specific syntax in code and don't need to be defined outside the pipeline first
It's more work to implement

skshetry · 2023-12-08T14:08:40Z

Maybe I am over optimizing here and is unnecessary, but dependencies can point to multiple dataset versions. That's what I was trying to avoid with this registration.

train:
  cmd: python train.py
  deps:
    - path: ds://dogs
      version: v2
    - path: ds://dogs
      version: v1

Secondly, they are stage-local. You can have a stage with a dataset of a version and a separate stage with a different version.

We could easily forbid this though (at least in that DatasetQuery/get_dataset api.

Thirdly, you'll have to update in multiple places than just one. But that's a minor issue.

skshetry · 2023-12-08T14:51:35Z

@dberenbaum, how would dvc and Studio understand which dataset is being used in 2a?

skshetry · 2023-12-08T15:15:05Z

Also, do you think 2b) is important? Why don't we push for pinning the dataset for stronger reproducibility?

User can update dvc.yaml when they want to update (and could be one-click from Studio if needed).

dberenbaum · 2023-12-08T18:12:56Z

Okay, I think you've convinced me on having top-level datasets as long as we are all aligned that the focus is on approach 2. I have some thoughts on minor modifications, and I do think 2b is important as it's probably the default workflow for a lot of people, but let's take the weekend to think about it and we can follow up on details next week.

dberenbaum · 2023-12-11T20:15:37Z

Adding some additional thoughts. Maybe you thought of some of these but didn't bother to detail yet.

dvc.yaml

$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...

Maybe we can generalize this more? We might want to reuse datasets in the future for other data sources (databases, cloud-versioned data, data from other repos). It seems like this is going in the direction of #2378, where datasets act like callback dependencies that don't rely on file hashing. WDYT about something like this (or we could keep the ds:// syntax but include dvcx somewhere)?

#dvc.yaml
dataset:
  dogs:
    type: dvcx
    name: dogs
    version: v2
stages:
  train:
    deps:
      - dataset: dogs

Also wondering how should it look in dvc.lock?

python api

from dvc.api import dataset

dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml

Can dataset.get() (or some method) return info about the dataset instead of actually reading the dataset? For example:

from dql.query import DatasetQuery
from dvc.api import dataset

ds = DatasetQuery(**dataset.get("dogs"))

As DVCX adds more methods and ways to stream data like pytorch data loaders, etc., we don't want to have to add more DVC API methods to support every DVCX way to load data. Users can keep their native DVCX code and only use DVC to provide the dataset info.

dmpetrov · 2023-12-11T20:40:30Z

@dberenbaum looks great!

the direction of #2378, where datasets act like callback dependencies that don't rely on file hashing

Yes. It's related - you ask update status using an external tool (DVCX, DBT or a custom code). IDK how much the generalization is needed at this point.

Also wondering how should it look in dvc.lock?

I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.

dberenbaum · 2023-12-12T16:09:17Z

I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.

Related to #982. Is it important when creating and deleting a dataset that you can reuse the same version numbers? If we have a UUID or timestamp, can DVC still get the deleted dataset to reproduce (assuming the raw storage hasn't changed)? What is the point of version numbers if we also plan to have UUID/timestamp?

dmpetrov · 2023-12-12T17:00:10Z

Name and version are human-readable.

It would be the same if dvc imports a file from a git repository. File name and tags are nice but it's better to use sha to a proper snapshot.

skshetry mentioned this issue Nov 28, 2023

Epic: Database table/non-file dependencies #9945

Closed

7 tasks

dberenbaum added the p1-important Important, aka current backlog of things to do label Dec 1, 2023

dberenbaum added this to DVC Dec 1, 2023

dberenbaum moved this to In Progress in DVC Dec 1, 2023

dberenbaum mentioned this issue Jan 4, 2024

introduce dataset dependency #10164

Merged

dberenbaum moved this from In Progress to Review In Progress in DVC Feb 6, 2024

dberenbaum moved this from Review In Progress to In Progress in DVC Feb 6, 2024

dberenbaum assigned skshetry Feb 22, 2024

skshetry linked a pull request Feb 23, 2024 that will close this issue

add support for tracking remote dataset #10287

Merged

skshetry closed this as completed in #10287 Feb 26, 2024

github-project-automation bot moved this from In Progress to Done in DVC Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVCX datasets #10114

DVCX datasets #10114

skshetry commented Nov 28, 2023 •

edited

Loading

skshetry commented Dec 1, 2023

skshetry commented Dec 2, 2023 •

edited

Loading

skshetry commented Dec 8, 2023 •

edited

Loading

skshetry commented Dec 8, 2023 •

edited

Loading

dberenbaum commented Dec 8, 2023 •

edited

Loading

skshetry commented Dec 8, 2023

skshetry commented Dec 8, 2023

dberenbaum commented Dec 8, 2023

skshetry commented Dec 8, 2023

skshetry commented Dec 8, 2023

skshetry commented Dec 8, 2023 •

edited

Loading

dberenbaum commented Dec 8, 2023

dberenbaum commented Dec 11, 2023

dmpetrov commented Dec 11, 2023

dberenbaum commented Dec 12, 2023

dmpetrov commented Dec 12, 2023 •

edited

Loading

DVCX datasets #10114

DVCX datasets #10114

Comments

skshetry commented Nov 28, 2023 • edited Loading

skshetry commented Dec 1, 2023

skshetry commented Dec 2, 2023 • edited Loading

skshetry commented Dec 8, 2023 • edited Loading

skshetry commented Dec 8, 2023 • edited Loading

dberenbaum commented Dec 8, 2023 • edited Loading

skshetry commented Dec 8, 2023

skshetry commented Dec 8, 2023

dberenbaum commented Dec 8, 2023

skshetry commented Dec 8, 2023

skshetry commented Dec 8, 2023

skshetry commented Dec 8, 2023 • edited Loading

dberenbaum commented Dec 8, 2023

dberenbaum commented Dec 11, 2023

dvc.yaml

python api

dmpetrov commented Dec 11, 2023

dberenbaum commented Dec 12, 2023

dmpetrov commented Dec 12, 2023 • edited Loading

skshetry commented Nov 28, 2023 •

edited

Loading

skshetry commented Dec 2, 2023 •

edited

Loading

skshetry commented Dec 8, 2023 •

edited

Loading

skshetry commented Dec 8, 2023 •

edited

Loading

dberenbaum commented Dec 8, 2023 •

edited

Loading

skshetry commented Dec 8, 2023 •

edited

Loading

dmpetrov commented Dec 12, 2023 •

edited

Loading