Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.api for dataset #10316

Merged
merged 6 commits into from
Feb 26, 2024
Merged

dvc.api for dataset #10316

merged 6 commits into from
Feb 26, 2024

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Feb 23, 2024

On top of #10287. Closes #10313.

Usage

from dvc.api import get_dataset

get_dataset("<dataset_name>")

The get_dataset() API returns a dict with values depending on the type of dataset.

Examples:

dvcx

from dvc.api import get_dataset
from dvcx.query import DatasetQuery

d = get_dataset("dogs")
print(d)

assert d["type"] == "dvcx"
query = DatasetQuery(name=d["name"], version=d["version"])
{
    'type': 'dvcx',
    'name': 'dogs',
    'version': 3,
}

dvc

from dvc.api import DVCFileSystem, get_dataset

d = get_dataset("example-get-started")
print(d)

assert d["type"] == "dvc"
fs = DVCFileSystem(d["url"], rev=d["sha"])
fs.read_text(d["path"])
{
    'type': 'dvc',
    'url': '[email protected]:iterative/example-get-started.git',
    'path': 'path',
    'sha': 'df75c16ef61f0772d6e4bb27ba4617b06b4b5398'
}

cloud-versioned and non-cloud-versioned remotes

from dvc.api import get_dataset
from fsspec import filesystem

d = get_dataset("cloud-demo")
fs = filesystem("s3")

assert d["type"] == "url"
fs.cat_file(d["path"]) # for a single file
fs.cat_file(d["files"][0]) # for a _directory_
{
    'type': 'url',
    'files': [
        's3://cloud-versioning-demo/myproject/model.pt?versionId=5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr',
        's3://cloud-versioning-demo/remote/predictions/1.txt?versionId=M2CEHoqXy3JZq07EmiLRMQCuj0wU8mbp'
    ],
    'path': 's3://cloud-versioning-demo'
}

Note

This will also expand remote:// URLs and return a canonicalized path.


The type is a union of the following typeddict:

class DVCXDataset(TypedDict):
    type: Literal["dvcx"]
    name: str
    version: int


class DVCDataset(TypedDict):
    type: Literal["dvc"]
    url: str
    path: str
    sha: str


class URLDataset(TypedDict):
    type: Literal["url"]
    files: list[str]
    path: str

Users can narrow type with assert d["type"] == "dvcx" (the sole reason why type key exists).

This adds support for virtually tracking a
dataset such as dvcx dataset, dataset from remote dvc/git
registries, and cloud-versioned remotes.

This PR introduces two different commands under `ds` namespace:
`add` and `update`.

`dvc ds add` command adds the dataset to `dvc.yaml` and tracks
the sources in `dvc.lock` file. Similarly, `dvc ds update`
updates the sources in `dvc.lock` file.

Example Usage:
```console
dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset
# tracking dvc/git registries
dvc ds add --name example --url dvc://[email protected]/iterative/example.git
# cloud-versioned-remotes
dvc ds add --name versioning --url s3://cloud-versioning-demo
```

To update, specify the name.
```console
dvc ds update <name>
```

`dvc ds add` freezes (in the traditional sense of `dvc import/import-url`).
It keeps a "specification" of the dataset in `dvc.yaml` and also freezes
information about the dataset in the `dvc.lock` file. They are kept
inside `datasets` section in both `dvc.yaml` and `dvc.lock` files.

This metadata is used in the pipelines. You can add a dependency to your
stage using `ds://` scheme, followed by the name of the dataset in
`dvc.yaml` file. As it is used in pipelines, the `name` of the dataset
has to be unique in the repository. Different dataset of same names
are not allowed. On `dvc repro`, `dvc` copies the frozen information
about the particular dataset into the `deps` field for the stage in
`dvc.lock`. On successive invocation, `dvc` will compare the information
from the deps field of the lock with the frozen information in the
`datasets` section and decides whether to rerun or not.

As the dataset is frozen, `dvc repro` won't rerun until the dataset is
updated via `dvc update`.

Here are some examples for how `dvc.yaml` looks like:
```yaml
datasets:
- name: dogs
  url: dvcx://dogs
  type: dvcx
- name: example-get-started
  url: [email protected]:iterative/example-get-started.git
  type: dvc
  path: path
- name: dogs2
  url: dvcx://dogs@v2
  type: dvcx
- name: cloud-versioning-demo
  url: s3://cloud-versioning-demo
  type: url
```

```yaml
schema: '2.0'
datasets:
- name: dogs
  url: dvcx://dogs
  type: dvcx
  version: 3
  created_at: '2023-12-11T10:32:05.942708+00:00'
- name: example-get-started
  url: [email protected]:iterative/example-get-started.git
  type: dvc
  path: path
  rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398
- name: cloud-versioning-demo
  url: s3://cloud-versioning-demo
  type: url
  meta:
    isdir: true
    size: 323919
    nfiles: 33
  files:
  - relpath: myproject/model.pt
    meta:
      size: 106433
      version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr
      etag: 3bc0028677ce6fb65bec8090c248b002
# truncated
```

The pipeline stages keep them in `dataset` section of individual deps.

```console
dvc stage add -n train -d ds://dogs python train.py
cat dvc.yaml
```

```yaml
# truncated
stages:
  train:
    cmd: python train.py
    deps:
    - ds://dogs
```

When it is reproduced, the `dvc.lock` will look something like follows:
```yaml
# truncated
stages:
  train:
    cmd: python train.py
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        url: dvcx://dogs
        type: dvcx
        version: 3
        created_at: '2023-12-11T10:32:05.942708+00:00'
```
Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 36.36364% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 90.00%. Comparing base (f2db666) to head (7ee1e6a).

Files Patch % Lines
dvc/api/dataset.py 34.88% 28 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           datasets   #10316      +/-   ##
============================================
- Coverage     90.07%   90.00%   -0.08%     
============================================
  Files           495      496       +1     
  Lines         38161    38205      +44     
  Branches       5532     5541       +9     
============================================
+ Hits          34375    34386      +11     
- Misses         3153     3184      +31     
- Partials        633      635       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

type: Literal["dvc"]
url: str
path: str
sha: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but can we name it rev to match the rest of the API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a rev, but actually a sha/commit-hash.

@dberenbaum
Copy link
Collaborator

from dvc.api.dataset import get

Do you anticipate additional methods? Even if you want a separate module, should we just expose it as a top-level API method like get_ds()?

@skshetry
Copy link
Member Author

from dvc.api.dataset import get

Do you anticipate additional methods? Even if you want a separate module, should we just expose it as a top-level API method like get_ds()?

I can rename it to get_dataset under top-level.

@skshetry skshetry changed the base branch from datasets to main February 26, 2024 15:43
@skshetry skshetry merged commit f38490c into iterative:main Feb 26, 2024
19 checks passed
@skshetry skshetry deleted the api-datasets branch February 26, 2024 15:44
BradyJ27 pushed a commit to BradyJ27/dvc that referenced this pull request Apr 22, 2024
* add support for tracking remote dataset

This adds support for virtually tracking a
dataset such as dvcx dataset, dataset from remote dvc/git
registries, and cloud-versioned remotes.

This PR introduces two different commands under `ds` namespace:
`add` and `update`.

`dvc ds add` command adds the dataset to `dvc.yaml` and tracks
the sources in `dvc.lock` file. Similarly, `dvc ds update`
updates the sources in `dvc.lock` file.

Example Usage:
```console
dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset
# tracking dvc/git registries
dvc ds add --name example --url dvc://[email protected]/iterative/example.git
# cloud-versioned-remotes
dvc ds add --name versioning --url s3://cloud-versioning-demo
```

To update, specify the name.
```console
dvc ds update <name>
```

`dvc ds add` freezes (in the traditional sense of `dvc import/import-url`).
It keeps a "specification" of the dataset in `dvc.yaml` and also freezes
information about the dataset in the `dvc.lock` file. They are kept
inside `datasets` section in both `dvc.yaml` and `dvc.lock` files.

This metadata is used in the pipelines. You can add a dependency to your
stage using `ds://` scheme, followed by the name of the dataset in
`dvc.yaml` file. As it is used in pipelines, the `name` of the dataset
has to be unique in the repository. Different dataset of same names
are not allowed. On `dvc repro`, `dvc` copies the frozen information
about the particular dataset into the `deps` field for the stage in
`dvc.lock`. On successive invocation, `dvc` will compare the information
from the deps field of the lock with the frozen information in the
`datasets` section and decides whether to rerun or not.

As the dataset is frozen, `dvc repro` won't rerun until the dataset is
updated via `dvc update`.

Here are some examples for how `dvc.yaml` looks like:
```yaml
datasets:
- name: dogs
  url: dvcx://dogs
  type: dvcx
- name: example-get-started
  url: [email protected]:iterative/example-get-started.git
  type: dvc
  path: path
- name: dogs2
  url: dvcx://dogs@v2
  type: dvcx
- name: cloud-versioning-demo
  url: s3://cloud-versioning-demo
  type: url
```

```yaml
schema: '2.0'
datasets:
- name: dogs
  url: dvcx://dogs
  type: dvcx
  version: 3
  created_at: '2023-12-11T10:32:05.942708+00:00'
- name: example-get-started
  url: [email protected]:iterative/example-get-started.git
  type: dvc
  path: path
  rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398
- name: cloud-versioning-demo
  url: s3://cloud-versioning-demo
  type: url
  meta:
    isdir: true
    size: 323919
    nfiles: 33
  files:
  - relpath: myproject/model.pt
    meta:
      size: 106433
      version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr
      etag: 3bc0028677ce6fb65bec8090c248b002
# truncated
```

The pipeline stages keep them in `dataset` section of individual deps.

```console
dvc stage add -n train -d ds://dogs python train.py
cat dvc.yaml
```

```yaml
# truncated
stages:
  train:
    cmd: python train.py
    deps:
    - ds://dogs
```

When it is reproduced, the `dvc.lock` will look something like follows:
```yaml
# truncated
stages:
  train:
    cmd: python train.py
    deps:
    - path: ds://dogs
      dataset:
        name: dogs
        url: dvcx://dogs
        type: dvcx
        version: 3
        created_at: '2023-12-11T10:32:05.942708+00:00'
```

* remodeling

* remove non-cloud-versioned remotes

* improve deserializing; handle invalidation

* dvc.api for dataset

* expose the API in the top-level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc.api.dataset
2 participants