-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DVCX datasets #10114
Comments
Does streaming dataset affect us in any way? I assume that'll be in user's code rather than dvc (and cached by dvcx). Is that just a dependency support without materializing? |
Okay, looks like for streaming datasets, we will need some APIs to redirect a tracked dataset in dvc to the appropriate dataset in dql. |
Discussed with @dmpetrov about this. He does not want dvc to compute md5 hashes for pipeline checksums, and wants that to be handled by dql itself, as in ask This would require us to decouple checksums that we use for pipelines and md5 hash that we use for data management. There is also an open question of how This would be a matter of figuring out how the API will work, and how the translation works. dvc stage add -d ds://dogs@v1 ... (For this, we'll need to figure out how the hypothetical I think we should do a single integration (or start with just one). Given the challenges with former, maybe it's better to start with the latter one. |
Thinking about another alternative, maybe we should create a top-level Eg: $ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ... from dvc.api import dataset
dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml Passing data through envvars won't work if you want to let users do
(We'll need to know if it changed or not to rerun stages). |
Just sent an invite for the 3 of us to discuss next week.
So these are the possible approaches I see for accessing DVCX datasets. Does it look correct to you two? This corresponds to phase 1 and phase 2 in notion. @skshetry Are you saying we should focus on approach 2? |
In the second diagram, dvc will have to act as an intermediary. The dataset info needs to be tracked in dvc's metadata, not in the source code, so that when users update the metadata (via Studio or through CLI), dvc will retrigger stages that changed. At the same time, the dataset that is being referenced in source code should change to the newer dataset, and should be used for training. Users could update in both places - in source code and in metadata, but making this one-click as possible - in Studio and in CLI, and be able to retrigger your pipelines and run your changes seems to be a feature that we want here. |
I am saying that we should start with one integration first. Regarding the first approach, there is a strict requirement here not to touch the dataset there and also technical challenges here. Given this, I am not sure if dvc needs to be even involved? Can users use dql to materialize it? Can it be just a dependency or an external dependency from dvc's point of view, and make dql the command to manipulate/checkout/pull datasets? |
Thanks @skshetry. I agree that starting with approach 2 makes sense, just wanted to clarify that's what you meant. I also agree DVC needs awareness of the dataset, but I'm not convinced we need top-level Within approach 2, I see two different scenarios: a. Fixed versions -- I want to specify the exact version every time 2a looks simple to me. Use the DVCX API directly in your code and parametrize it with DVC: # params.yaml
dataset:
name: dogs
version: 2 # dvc.yaml
stages:
train:
cmd: python train.py ${dataset} ds = DatasetQuery(name=name, version=version) 2b is where we need DVC to snapshot the version, but only because the user doesn't want to specify it in code, so it can look like this: # dvc.yaml
stages:
train:
cmd: python train.py
deps:
- ds://dogs # dvc.lock
train:
cmd: python train.py
deps:
- path: ds://dogs
- version: v2 ds = DatasetQuery(name=dogs)
In 2a, it can be done by updating the parameters. In 2b, the update will happen automatically.
I would prefer not to because:
|
Maybe I am over optimizing here and is unnecessary, but dependencies can point to multiple dataset versions. That's what I was trying to avoid with this registration. train:
cmd: python train.py
deps:
- path: ds://dogs
version: v2
- path: ds://dogs
version: v1 Secondly, they are stage-local. You can have a stage with a dataset of a version and a separate stage with a different version. We could easily forbid this though (at least in that Thirdly, you'll have to update in multiple places than just one. But that's a minor issue. |
@dberenbaum, how would dvc and Studio understand which dataset is being used in 2a? |
Also, do you think 2b) is important? Why don't we push for pinning the dataset for stronger reproducibility? User can update dvc.yaml when they want to update (and could be one-click from Studio if needed). |
Okay, I think you've convinced me on having top-level |
Adding some additional thoughts. Maybe you thought of some of these but didn't bother to detail yet. dvc.yaml
Maybe we can generalize this more? We might want to reuse #dvc.yaml
dataset:
dogs:
type: dvcx
name: dogs
version: v2
stages:
train:
deps:
- dataset: dogs Also wondering how should it look in python api
Can from dql.query import DatasetQuery
from dvc.api import dataset
ds = DatasetQuery(**dataset.get("dogs")) As DVCX adds more methods and ways to stream data like pytorch data loaders, etc., we don't want to have to add more DVC API methods to support every DVCX way to load data. Users can keep their native DVCX code and only use DVC to provide the dataset info. |
@dberenbaum looks great!
Yes. It's related - you ask update status using an external tool (DVCX, DBT or a custom code). IDK how much the generalization is needed at this point.
I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story. |
Related to #982. Is it important when creating and deleting a dataset that you can reuse the same version numbers? If we have a UUID or timestamp, can DVC still get the deleted dataset to reproduce (assuming the raw storage hasn't changed)? What is the point of version numbers if we also plan to have UUID/timestamp? |
Name and version are human-readable. It would be the same if dvc imports a file from a git repository. File name and tags are nice but it's better to use sha to a proper snapshot. |
No description provided.
The text was updated successfully, but these errors were encountered: