dvc.api for dataset #10316

skshetry · 2024-02-23T05:44:39Z

On top of #10287. Closes #10313.

Usage

from dvc.api import get_dataset

get_dataset("<dataset_name>")

The get_dataset() API returns a dict with values depending on the type of dataset.

Examples:

dvcx

from dvc.api import get_dataset
from dvcx.query import DatasetQuery

d = get_dataset("dogs")
print(d)

assert d["type"] == "dvcx"
query = DatasetQuery(name=d["name"], version=d["version"])

{
    'type': 'dvcx',
    'name': 'dogs',
    'version': 3,
}

dvc

from dvc.api import DVCFileSystem, get_dataset

d = get_dataset("example-get-started")
print(d)

assert d["type"] == "dvc"
fs = DVCFileSystem(d["url"], rev=d["sha"])
fs.read_text(d["path"])

{
    'type': 'dvc',
    'url': '[email protected]:iterative/example-get-started.git',
    'path': 'path',
    'sha': 'df75c16ef61f0772d6e4bb27ba4617b06b4b5398'
}

cloud-versioned and non-cloud-versioned remotes

from dvc.api import get_dataset
from fsspec import filesystem

d = get_dataset("cloud-demo")
fs = filesystem("s3")

assert d["type"] == "url"
fs.cat_file(d["path"]) # for a single file
fs.cat_file(d["files"][0]) # for a _directory_

{
    'type': 'url',
    'files': [
        's3://cloud-versioning-demo/myproject/model.pt?versionId=5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr',
        's3://cloud-versioning-demo/remote/predictions/1.txt?versionId=M2CEHoqXy3JZq07EmiLRMQCuj0wU8mbp'
    ],
    'path': 's3://cloud-versioning-demo'
}

Note

This will also expand remote:// URLs and return a canonicalized path.

The type is a union of the following typeddict:

class DVCXDataset(TypedDict):
    type: Literal["dvcx"]
    name: str
    version: int


class DVCDataset(TypedDict):
    type: Literal["dvc"]
    url: str
    path: str
    sha: str


class URLDataset(TypedDict):
    type: Literal["url"]
    files: list[str]
    path: str

Users can narrow type with assert d["type"] == "dvcx" (the sole reason why type key exists).

This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes. This PR introduces two different commands under `ds` namespace: `add` and `update`. `dvc ds add` command adds the dataset to `dvc.yaml` and tracks the sources in `dvc.lock` file. Similarly, `dvc ds update` updates the sources in `dvc.lock` file. Example Usage: ```console dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset # tracking dvc/git registries dvc ds add --name example --url dvc://[email protected]/iterative/example.git # cloud-versioned-remotes dvc ds add --name versioning --url s3://cloud-versioning-demo ``` To update, specify the name. ```console dvc ds update <name> ``` `dvc ds add` freezes (in the traditional sense of `dvc import/import-url`). It keeps a "specification" of the dataset in `dvc.yaml` and also freezes information about the dataset in the `dvc.lock` file. They are kept inside `datasets` section in both `dvc.yaml` and `dvc.lock` files. This metadata is used in the pipelines. You can add a dependency to your stage using `ds://` scheme, followed by the name of the dataset in `dvc.yaml` file. As it is used in pipelines, the `name` of the dataset has to be unique in the repository. Different dataset of same names are not allowed. On `dvc repro`, `dvc` copies the frozen information about the particular dataset into the `deps` field for the stage in `dvc.lock`. On successive invocation, `dvc` will compare the information from the deps field of the lock with the frozen information in the `datasets` section and decides whether to rerun or not. As the dataset is frozen, `dvc repro` won't rerun until the dataset is updated via `dvc update`. Here are some examples for how `dvc.yaml` looks like: ```yaml datasets: - name: dogs url: dvcx://dogs type: dvcx - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path - name: dogs2 url: dvcx://dogs@v2 type: dvcx - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url ``` ```yaml schema: '2.0' datasets: - name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398 - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url meta: isdir: true size: 323919 nfiles: 33 files: - relpath: myproject/model.pt meta: size: 106433 version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr etag: 3bc0028677ce6fb65bec8090c248b002 # truncated ``` The pipeline stages keep them in `dataset` section of individual deps. ```console dvc stage add -n train -d ds://dogs python train.py cat dvc.yaml ``` ```yaml # truncated stages: train: cmd: python train.py deps: - ds://dogs ``` When it is reproduced, the `dvc.lock` will look something like follows: ```yaml # truncated stages: train: cmd: python train.py deps: - path: ds://dogs dataset: name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' ```

codecov · 2024-02-23T05:50:23Z

Codecov Report

Attention: Patch coverage is 36.36364% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 90.00%. Comparing base (f2db666) to head (7ee1e6a).

Files	Patch %	Lines
dvc/api/dataset.py	34.88%	28 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           datasets   #10316      +/-   ##
============================================
- Coverage     90.07%   90.00%   -0.08%     
============================================
  Files           495      496       +1     
  Lines         38161    38205      +44     
  Branches       5532     5541       +9     
============================================
+ Hits          34375    34386      +11     
- Misses         3153     3184      +31     
- Partials        633      635       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dberenbaum · 2024-02-23T16:37:43Z

dvc/api/dataset.py

+    type: Literal["dvc"]
+    url: str
+    path: str
+    sha: str


Minor, but can we name it rev to match the rest of the API?

This is not a rev, but actually a sha/commit-hash.

dberenbaum · 2024-02-23T16:48:50Z

from dvc.api.dataset import get

Do you anticipate additional methods? Even if you want a separate module, should we just expose it as a top-level API method like get_ds()?

skshetry · 2024-02-23T17:26:26Z

from dvc.api.dataset import get

Do you anticipate additional methods? Even if you want a separate module, should we just expose it as a top-level API method like get_ds()?

I can rename it to get_dataset under top-level.

* add support for tracking remote dataset This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes. This PR introduces two different commands under `ds` namespace: `add` and `update`. `dvc ds add` command adds the dataset to `dvc.yaml` and tracks the sources in `dvc.lock` file. Similarly, `dvc ds update` updates the sources in `dvc.lock` file. Example Usage: ```console dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset # tracking dvc/git registries dvc ds add --name example --url dvc://[email protected]/iterative/example.git # cloud-versioned-remotes dvc ds add --name versioning --url s3://cloud-versioning-demo ``` To update, specify the name. ```console dvc ds update <name> ``` `dvc ds add` freezes (in the traditional sense of `dvc import/import-url`). It keeps a "specification" of the dataset in `dvc.yaml` and also freezes information about the dataset in the `dvc.lock` file. They are kept inside `datasets` section in both `dvc.yaml` and `dvc.lock` files. This metadata is used in the pipelines. You can add a dependency to your stage using `ds://` scheme, followed by the name of the dataset in `dvc.yaml` file. As it is used in pipelines, the `name` of the dataset has to be unique in the repository. Different dataset of same names are not allowed. On `dvc repro`, `dvc` copies the frozen information about the particular dataset into the `deps` field for the stage in `dvc.lock`. On successive invocation, `dvc` will compare the information from the deps field of the lock with the frozen information in the `datasets` section and decides whether to rerun or not. As the dataset is frozen, `dvc repro` won't rerun until the dataset is updated via `dvc update`. Here are some examples for how `dvc.yaml` looks like: ```yaml datasets: - name: dogs url: dvcx://dogs type: dvcx - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path - name: dogs2 url: dvcx://dogs@v2 type: dvcx - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url ``` ```yaml schema: '2.0' datasets: - name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398 - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url meta: isdir: true size: 323919 nfiles: 33 files: - relpath: myproject/model.pt meta: size: 106433 version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr etag: 3bc0028677ce6fb65bec8090c248b002 # truncated ``` The pipeline stages keep them in `dataset` section of individual deps. ```console dvc stage add -n train -d ds://dogs python train.py cat dvc.yaml ``` ```yaml # truncated stages: train: cmd: python train.py deps: - ds://dogs ``` When it is reproduced, the `dvc.lock` will look something like follows: ```yaml # truncated stages: train: cmd: python train.py deps: - path: ds://dogs dataset: name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' ``` * remodeling * remove non-cloud-versioned remotes * improve deserializing; handle invalidation * dvc.api for dataset * expose the API in the top-level

skshetry force-pushed the api-datasets branch from 4287695 to 3309e15 Compare February 23, 2024 05:52

dberenbaum reviewed Feb 23, 2024

View reviewed changes

skshetry added 2 commits February 24, 2024 12:39

remodeling

fbf57f0

remove non-cloud-versioned remotes

3e121de

skshetry force-pushed the api-datasets branch from 3309e15 to a19bddd Compare February 24, 2024 07:00

skshetry added 3 commits February 24, 2024 17:55

improve deserializing; handle invalidation

f2db666

dvc.api for dataset

35eb649

expose the API in the top-level

7ee1e6a

skshetry force-pushed the api-datasets branch from a19bddd to 7ee1e6a Compare February 24, 2024 12:13

dberenbaum mentioned this pull request Feb 26, 2024

add support for tracking remote dataset #10287

Merged

dberenbaum approved these changes Feb 26, 2024

View reviewed changes

skshetry changed the base branch from datasets to main February 26, 2024 15:43

skshetry merged commit f38490c into iterative:main Feb 26, 2024
19 checks passed

skshetry deleted the api-datasets branch February 26, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc.api for dataset #10316

dvc.api for dataset #10316

skshetry commented Feb 23, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading

dberenbaum Feb 23, 2024

skshetry Feb 23, 2024

dberenbaum commented Feb 23, 2024

skshetry commented Feb 23, 2024

dvc.api for dataset #10316

dvc.api for dataset #10316

Conversation

skshetry commented Feb 23, 2024 • edited Loading

Usage

Examples:

dvcx

dvc

cloud-versioned and non-cloud-versioned remotes

codecov bot commented Feb 23, 2024 • edited Loading

Codecov Report

dberenbaum Feb 23, 2024

Choose a reason for hiding this comment

skshetry Feb 23, 2024

Choose a reason for hiding this comment

dberenbaum commented Feb 23, 2024

skshetry commented Feb 23, 2024

skshetry commented Feb 23, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading