-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for tracking remote dataset #10287
Conversation
UX looks good! There have been discussions before about consolidating the import syntax (cc @efiop) and wondering if we should take the opportunity to do it here. For example:
|
Notes from discussion with @skshetry:
|
4ffe4c7
to
84a5cf5
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10287 +/- ##
==========================================
- Coverage 90.59% 90.07% -0.52%
==========================================
Files 493 495 +2
Lines 37774 38161 +387
Branches 5460 5532 +72
==========================================
+ Hits 34221 34375 +154
- Misses 2919 3153 +234
+ Partials 634 633 -1 ☔ View full report in Codecov by Sentry. |
@skshetry What's the status of this? |
This should be ready for review. The above asciinema is stale (although the only change is removal of eg: dvc stage add -ntest -d ds://dataset_name "command" To import data from dvc/git repositories, you'll need dvc:// schema to the repo url, e.g:
I'll try to update the asciinema recording tomorrow, and add more descriptions in the PR. |
Updated the asciinema recording above. |
3a459f9
to
9fdb8ae
Compare
This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes. This PR introduces two different commands under `ds` namespace: `add` and `update`. `dvc ds add` command adds the dataset to `dvc.yaml` and tracks the sources in `dvc.lock` file. Similarly, `dvc ds update` updates the sources in `dvc.lock` file. Example Usage: ```console dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset # tracking dvc/git registries dvc ds add --name example --url dvc://[email protected]/iterative/example.git # cloud-versioned-remotes dvc ds add --name versioning --url s3://cloud-versioning-demo ``` To update, specify the name. ```console dvc ds update <name> ``` `dvc ds add` freezes (in the traditional sense of `dvc import/import-url`). It keeps a "specification" of the dataset in `dvc.yaml` and also freezes information about the dataset in the `dvc.lock` file. They are kept inside `datasets` section in both `dvc.yaml` and `dvc.lock` files. This metadata is used in the pipelines. You can add a dependency to your stage using `ds://` scheme, followed by the name of the dataset in `dvc.yaml` file. As it is used in pipelines, the `name` of the dataset has to be unique in the repository. Different dataset of same names are not allowed. On `dvc repro`, `dvc` copies the frozen information about the particular dataset into the `deps` field for the stage in `dvc.lock`. On successive invocation, `dvc` will compare the information from the deps field of the lock with the frozen information in the `datasets` section and decides whether to rerun or not. As the dataset is frozen, `dvc repro` won't rerun until the dataset is updated via `dvc update`. Here are some examples for how `dvc.yaml` looks like: ```yaml datasets: - name: dogs url: dvcx://dogs type: dvcx - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path - name: dogs2 url: dvcx://dogs@v2 type: dvcx - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url ``` ```yaml schema: '2.0' datasets: - name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398 - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url meta: isdir: true size: 323919 nfiles: 33 files: - relpath: myproject/model.pt meta: size: 106433 version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr etag: 3bc0028677ce6fb65bec8090c248b002 # truncated ``` The pipeline stages keep them in `dataset` section of individual deps. ```console dvc stage add -n train -d ds://dogs python train.py cat dvc.yaml ``` ```yaml # truncated stages: train: cmd: python train.py deps: - ds://dogs ``` When it is reproduced, the `dvc.lock` will look something like follows: ```yaml # truncated stages: train: cmd: python train.py deps: - path: ds://dogs dataset: name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' ```
return self.msg | ||
|
||
|
||
class Datasets(Mapping[str, Dataset]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datasets
is a registry API that will be useful for Studio
to get information about datasets, and update the dataset if needed (there are some plans to offer "Run an experiment" with updated versions of the dataset).
It provides a dictionary-like API.
lockfile = Lockfile(self.repo, Path(manifest_path).with_suffix(".lock")) | ||
lockfile.dump_dataset(lock_data) | ||
|
||
def dump(self, dataset: Dataset, old: Optional[Dataset] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This provides a diff update to avoid accidentally modifying dvc.yaml
or dvc.lock
when not needed.
@@ -47,6 +47,7 @@ | |||
LOCKFILE_STAGES_SCHEMA = {str: LOCK_FILE_STAGE_SCHEMA} | |||
LOCKFILE_SCHEMA = { | |||
vol.Required("schema"): vol.Equal("2.0", "invalid schema version"), | |||
"datasets": object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not making this schema stricter for now. We'll define this later.
I need to figure out how to define this without duplicating with Dataset
/DatasetLock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with dvc.yaml
schema below.
dvc/repo/datasets.py
Outdated
|
||
|
||
@frozen(kw_only=True, order=True) | ||
class FileInfo(SerDe): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have split up files
metadata into relpath
and meta
. In dvc.lock
, they are a single dictionary.
I did this to make it simpler to serialize/deserialize, and I find it conceptually better, as the meta
is for a path
.
We do have this inconsistency now with cloud-versioned outputs, though.
dvc/repo/datasets.py
Outdated
|
||
|
||
@frozen(kw_only=True) | ||
class DVCDataset(SerDe): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the *Dataset
classes has a corresponding *DatasetLock
class. *Dataset
defines the specification for the dataset. *DatasetLock
captures source information, but you'll see here that it also inherits from *Dataset
classes. I am duplicating this information, because I want it to be possible to generate a dvc.yaml
spec out of lock
. This might be useful in the future. I definitely have wanted something similar with the stages (eg: run cache
being useful to regenerate dvc.yaml etc.).
DatasetLock/Dataset are implemented from a serialization/deserialization perspective, and I haven't put much thought into anything else. We'll likely need to refactor them in the future. I do regret not using pydantic
here, but attrs
works for now.
def __str__(self) -> str: | ||
return self.msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to override __str__
as KeyError
quotes the message.
|
||
name, version = self.name_version | ||
catalog = get_catalog() | ||
record = catalog.get_remote_dataset(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dvcx
datasets only work with SaaS at the moment. But, we can support it in dvc easily.
But supporting it in Studio
is a different matter. So I have left it out.
There is no unique identifier for a dataset. There is a created_at
timestamp. But this is only reliable for a remote dataset. Local dataset will have created_at
timestamp from when it was pull
-ed to the machine.
There is no provenance information for the dataset, either. The dataset could have come from a local dvcx instances, or a self-hosted Studio instances.
(I haven't looked at shadow datasets or made any effort to understand what they are.)
I'd really prefer a UUID or anything equivalent that is backed by some randomness rather than a timestamp.
We have another problem here. The datasets are local to each team in dvcx
, but "Project" in Studio aren't. There could be projects in your private namespace, and some projects might be in a team (or, multiple teams') namespace. And you can have different dataset of the same name among different teams. Considering these federated/distributed hosts, and multiple teams, I'd not trust the timestamp at all.
That said, I think it's best to ignore these things, unless dvcx starts taking these things seriously. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it only makes sense for a remote/SaaS dataset. What's the point of versioning a local instance that could differ between users? If you want to pull a dataset locally, you can start with the remote dataset and track it with dvc ds add
, then make that a dependency of a stage that uses the dvcx api to pull it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is that there is a mismatch between how dvc and dvcx works. DVC will look at the SaaS for the dataset, but dvcx works based on a local index.
So, a user might have a dataset dogs
that was locally-created, but dvc will try to look into SaaS for it. The DatasetQuery
runs locally based on that local dataset.
So we are not versioning what might actually get used.
Also, AFAIK the plan is to release CLI first and make some noise with it. So the user won't be able to use this integration unless we release SaaS at a later date (maybe end of this year?).
It'd be nice if we could always track the dataset that is available locally (and only fallback to SaaS if it does not exist), and have provenance information of where the dataset originated from so that we could link it back to the SaaS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much you can do with a CLI version. It's fine if we want to only integrate with the SaaS. I just wanted to note that there is a difference of behavior between dvc
and dvcx
.
I guess that could be easily fixed on dvcx
side via some API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the example in #10316 doesn't make much sense as is since it will query the local instance, right?
There are a couple ways the workflow could make sense:
- API methods to read from SaaS.
- API to clone the dataset from SaaS to local.
The I find lists simpler and easier to use, with fewer indentations. But I am still undecided. |
Nice @skshetry! It works well for me in doing QA over various different types. I mostly have UI questions.
Do you have thoughts on trying to incorporate the path and rev into the url somehow like this? It feels like we are now stuck needing both complex CLI options ( Other questions:
|
I am not against supporting this. But I find this url to be very complex, and I doubt if I'll remember it when I need it. Even if we do support both, I'd prefer keeping Also, parsing these scp-style path is difficult IIUC as - [email protected]:iterative/example-get-started.git
+ ssh://[email protected]/iterative/example-get-started.git
yeah, both
I am not sure about the intended use case. It might be useful in Studio to show if the source changed between commits from the metadata, but I am not sure how useful that information is. We could disable it for now.
This should just work without doing anything. dvc remote add myremote s3://bucketname/prefix
dvc ds add --url remote://myremote/path/to/file/or/dir --name name |
Yup, it might be worse, especially if you think we should keep
Does it include a hash in |
Is
It has an etag, but we don't have a hash of the content (eg: md5). To hash the content, you'd have to stream them from the remote.
Since the import is frozen on And non-cloud versioned remotes can not be frozen at all, so this will require a different solution, in which we check the sources during |
I don't see a good reason to support non-versioned types like |
I'll disable non-cloud-versioned remotes/filesystems in the successive PRs. |
Can you drop them from the parser help text in this PR? I was focused on that more than actually disabling them. Edit: Or is there some case where |
I have dropped them from the help text. |
* add support for tracking remote dataset This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes. This PR introduces two different commands under `ds` namespace: `add` and `update`. `dvc ds add` command adds the dataset to `dvc.yaml` and tracks the sources in `dvc.lock` file. Similarly, `dvc ds update` updates the sources in `dvc.lock` file. Example Usage: ```console dvc ds add --name dogs --url dvcx://dogs # tracking dvcx dataset # tracking dvc/git registries dvc ds add --name example --url dvc://[email protected]/iterative/example.git # cloud-versioned-remotes dvc ds add --name versioning --url s3://cloud-versioning-demo ``` To update, specify the name. ```console dvc ds update <name> ``` `dvc ds add` freezes (in the traditional sense of `dvc import/import-url`). It keeps a "specification" of the dataset in `dvc.yaml` and also freezes information about the dataset in the `dvc.lock` file. They are kept inside `datasets` section in both `dvc.yaml` and `dvc.lock` files. This metadata is used in the pipelines. You can add a dependency to your stage using `ds://` scheme, followed by the name of the dataset in `dvc.yaml` file. As it is used in pipelines, the `name` of the dataset has to be unique in the repository. Different dataset of same names are not allowed. On `dvc repro`, `dvc` copies the frozen information about the particular dataset into the `deps` field for the stage in `dvc.lock`. On successive invocation, `dvc` will compare the information from the deps field of the lock with the frozen information in the `datasets` section and decides whether to rerun or not. As the dataset is frozen, `dvc repro` won't rerun until the dataset is updated via `dvc update`. Here are some examples for how `dvc.yaml` looks like: ```yaml datasets: - name: dogs url: dvcx://dogs type: dvcx - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path - name: dogs2 url: dvcx://dogs@v2 type: dvcx - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url ``` ```yaml schema: '2.0' datasets: - name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' - name: example-get-started url: [email protected]:iterative/example-get-started.git type: dvc path: path rev_lock: df75c16ef61f0772d6e4bb27ba4617b06b4b5398 - name: cloud-versioning-demo url: s3://cloud-versioning-demo type: url meta: isdir: true size: 323919 nfiles: 33 files: - relpath: myproject/model.pt meta: size: 106433 version_id: 5qrtnhnQ4fBzV73kqqK6pMGhTOzd_IPr etag: 3bc0028677ce6fb65bec8090c248b002 # truncated ``` The pipeline stages keep them in `dataset` section of individual deps. ```console dvc stage add -n train -d ds://dogs python train.py cat dvc.yaml ``` ```yaml # truncated stages: train: cmd: python train.py deps: - ds://dogs ``` When it is reproduced, the `dvc.lock` will look something like follows: ```yaml # truncated stages: train: cmd: python train.py deps: - path: ds://dogs dataset: name: dogs url: dvcx://dogs type: dvcx version: 3 created_at: '2023-12-11T10:32:05.942708+00:00' ``` * remodeling * remove non-cloud-versioned remotes * improve deserializing; handle invalidation
This adds support for virtually tracking a dataset such as dvcx dataset, dataset from remote dvc/git registries, and cloud-versioned remotes.
This PR introduces two different commands under
ds
namespace:add
andupdate
.dvc ds add
command adds the dataset todvc.yaml
and tracks the sources indvc.lock
file. Similarly,dvc ds update
updates the sources indvc.lock
file.Example Usage:
Warning
This
<schema>://
URI format was removed, and replaced with appropriate--dvc
/--dvcx
/--url
flags in #10326.To update, specify the name.
dvc ds update <name>
dvc ds add
freezes (in the traditional sense ofdvc import/import-url
). It keeps a "specification" of the dataset indvc.yaml
and also snapshots information about the dataset in thedvc.lock
file. They are kept insidedatasets
section in bothdvc.yaml
anddvc.lock
files.This metadata is used in the pipelines. You can add a dependency to your stage using
ds://
scheme, followed by the name of the dataset indvc.yaml
file. As it is used in pipelines, thename
of the dataset has to be unique in the repository. Different dataset of same names are not allowed. Ondvc repro
,dvc
copies the snapshot information about the particular dataset into thedeps
field for the stage indvc.lock
. On successive invocation,dvc
will compare the information from the deps field of the lock with the snapshot information in thedatasets
section and decides whether to rerun or not.As the dataset is frozen,
dvc repro
won't rerun until the dataset is updated viadvc update
.Here are some examples for how
dvc.yaml
looks like:Note
The
files
indatasets
section differs slightly compared to cloud-versioned outputs, as this separatesrelpath
and metadata inmeta
, whereasoutput
has them together.The pipeline stages keep them in
dataset
section of individual deps.When it is reproduced, the
dvc.lock
will look something like follows:See also: