-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc.api for dataset #10316
dvc.api for dataset #10316
Conversation
This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes. This PR introduces two different commands under `ds` namespace: `add` and `update`. `dvc ds add` command adds the dataset to `dvc.yaml` and tracks the sources in `dvc.lock` file. Similarly, `dvc ds update` updates the sources in `dvc.lock` file. Example Usage: ```console dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset # tracking dvc/git registries dvc ds add --name example --url dvc://[email protected]/iterative/example.git # cloud-versioned-remotes dvc ds add --name versioning --url s3://cloud-versioning-demo ``` To update, specify the name. ```console dvc ds update <name> ``` `dvc ds add` freezes (in the traditional sense of `dvc import/import-url`). It keeps a "specification" of the dataset in `dvc.yaml` and also freezes information about the dataset in the `dvc.lock` file. They are kept inside `datasets` section in both `dvc.yaml` and `dvc.lock` files. This metadata is used in the pipelines. You can add a dependency to your stage using `ds://` scheme, followed by the name of the dataset in `dvc.yaml` file. As it is used in pipelines, the `name` of the dataset has to be unique in the repository. Different dataset of same names are not allowed. On `dvc repro`, `dvc` copies the frozen information about the particular dataset into the `deps` field for the stage in `dvc.lock`. On successive invocation, `dvc` will compare the information from the deps field of the lock with the frozen information in the `datasets` section and decides whether to rerun or not. As the dataset is frozen, `dvc repro` won't rerun until the dataset is updated via `dvc update`. Here are some examples for how `dvc.yaml` looks like: ```yaml datasets: - name: dogs url: dvcx://dogs type: dvcx - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path - name: dogs2 url: dvcx://dogs@v2 type: dvcx - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url ``` ```yaml schema: '2.0' datasets: - name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398 - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url meta: isdir: true size: 323919 nfiles: 33 files: - relpath: myproject/model.pt meta: size: 106433 version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr etag: 3bc0028677ce6fb65bec8090c248b002 # truncated ``` The pipeline stages keep them in `dataset` section of individual deps. ```console dvc stage add -n train -d ds://dogs python train.py cat dvc.yaml ``` ```yaml # truncated stages: train: cmd: python train.py deps: - ds://dogs ``` When it is reproduced, the `dvc.lock` will look something like follows: ```yaml # truncated stages: train: cmd: python train.py deps: - path: ds://dogs dataset: name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' ```
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## datasets #10316 +/- ##
============================================
- Coverage 90.07% 90.00% -0.08%
============================================
Files 495 496 +1
Lines 38161 38205 +44
Branches 5532 5541 +9
============================================
+ Hits 34375 34386 +11
- Misses 3153 3184 +31
- Partials 633 635 +2 ☔ View full report in Codecov by Sentry. |
4287695
to
3309e15
Compare
type: Literal["dvc"] | ||
url: str | ||
path: str | ||
sha: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but can we name it rev
to match the rest of the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a rev, but actually a sha
/commit-hash
.
Do you anticipate additional methods? Even if you want a separate module, should we just expose it as a top-level API method like |
I can rename it to |
3309e15
to
a19bddd
Compare
a19bddd
to
7ee1e6a
Compare
* add support for tracking remote dataset This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes. This PR introduces two different commands under `ds` namespace: `add` and `update`. `dvc ds add` command adds the dataset to `dvc.yaml` and tracks the sources in `dvc.lock` file. Similarly, `dvc ds update` updates the sources in `dvc.lock` file. Example Usage: ```console dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset # tracking dvc/git registries dvc ds add --name example --url dvc://[email protected]/iterative/example.git # cloud-versioned-remotes dvc ds add --name versioning --url s3://cloud-versioning-demo ``` To update, specify the name. ```console dvc ds update <name> ``` `dvc ds add` freezes (in the traditional sense of `dvc import/import-url`). It keeps a "specification" of the dataset in `dvc.yaml` and also freezes information about the dataset in the `dvc.lock` file. They are kept inside `datasets` section in both `dvc.yaml` and `dvc.lock` files. This metadata is used in the pipelines. You can add a dependency to your stage using `ds://` scheme, followed by the name of the dataset in `dvc.yaml` file. As it is used in pipelines, the `name` of the dataset has to be unique in the repository. Different dataset of same names are not allowed. On `dvc repro`, `dvc` copies the frozen information about the particular dataset into the `deps` field for the stage in `dvc.lock`. On successive invocation, `dvc` will compare the information from the deps field of the lock with the frozen information in the `datasets` section and decides whether to rerun or not. As the dataset is frozen, `dvc repro` won't rerun until the dataset is updated via `dvc update`. Here are some examples for how `dvc.yaml` looks like: ```yaml datasets: - name: dogs url: dvcx://dogs type: dvcx - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path - name: dogs2 url: dvcx://dogs@v2 type: dvcx - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url ``` ```yaml schema: '2.0' datasets: - name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398 - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url meta: isdir: true size: 323919 nfiles: 33 files: - relpath: myproject/model.pt meta: size: 106433 version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr etag: 3bc0028677ce6fb65bec8090c248b002 # truncated ``` The pipeline stages keep them in `dataset` section of individual deps. ```console dvc stage add -n train -d ds://dogs python train.py cat dvc.yaml ``` ```yaml # truncated stages: train: cmd: python train.py deps: - ds://dogs ``` When it is reproduced, the `dvc.lock` will look something like follows: ```yaml # truncated stages: train: cmd: python train.py deps: - path: ds://dogs dataset: name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' ``` * remodeling * remove non-cloud-versioned remotes * improve deserializing; handle invalidation * dvc.api for dataset * expose the API in the top-level
On top of #10287. Closes #10313.
Usage
The
get_dataset()
API returns a dict with values depending on the type of dataset.Examples:
dvcx
dvc
cloud-versioned and non-cloud-versioned remotes
Note
This will also expand
remote://
URLs and return a canonicalized path.The type is a union of the following typeddict:
Users can narrow type with
assert d["type"] == "dvcx"
(the sole reason whytype
key exists).