Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming DVC imports #10232

Closed
dberenbaum opened this issue Jan 11, 2024 · 8 comments · Fixed by #10287
Closed

Streaming DVC imports #10232

dberenbaum opened this issue Jan 11, 2024 · 8 comments · Fixed by #10287
Assignees
Labels
A: api Related to the dvc.api p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Collaborator

#10164 will introduce datasets as a new type of dependency that aren't based on the local filesystem. This same mechanism can be used to stream data from other DVC repos. Unlike dvc import, no local copy of the data is needed. Users can specify a revision, freeze it, make it a stage dependency, and stream it into their code using the DVC API.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do A: api Related to the dvc.api labels Jan 11, 2024
@shcheklein
Copy link
Member

@dberenbaum

and stream it into their code using the DVC API

could you clarify this please? what DVC API is going to be used?

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Jan 11, 2024

That's still being worked out in #10164, but you can see there examples of how a dvc.api.dataset may work for DVCX at least. For streaming DVC imports, it could either return info like repo url, revision hash, etc. to pass to another API like DVCFilesystem, or it could be a wrapper around DVCFileSystem.

@shcheklein
Copy link
Member

Got it. I wonder if this is needed (vs people using their own tools to access data the way they want) - should we just have a way to pass some information about the dependency to the user code? (it was asked btw in some other contexts I think).

@dberenbaum
Copy link
Collaborator Author

should we just have a way to pass some information about the dependency to the user code?

Sorry, I don't think I follow how that differs from the example in #10164 where dvc.api.dataset returns the dataset name and version?

@dberenbaum
Copy link
Collaborator Author

Actually, I thought there was only a DVCX example in #10164, but there's also one for streaming DVC imports that looks like this:

from dvc.api.dataset import DVCDataset, get
from dvc.fs.dvc import DVCFileSystem

resolved = get(DVCDataset, "stackoverflow")
fs = DVCFileSystem(url=resolved.url, rev=resolved.rev)
with fs.open(resolved.path) as f:
    process_posts(f.readlines())

@shcheklein
Copy link
Member

okay, I see. I got confused by For streaming DVC imports, it could either return info like repo url, revision hash, etc. to pass to another API like DVCFilesystem, or it could be a wrapper around DVCFileSystem. - but I see that this just an example for the DVC-specific deps.

Good then and makes sense. The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Jan 11, 2024

The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general.

Good point. Related to #10179. Maybe we can combine these APIs. cc @skshetry

@dberenbaum dberenbaum added this to DVC Jan 23, 2024
@dberenbaum dberenbaum moved this to Todo in DVC Jan 23, 2024
@dberenbaum
Copy link
Collaborator Author

@skshetry Forgot that we already have this issues and #10231. Added both to the project board. Would be great to also get your thoughts on the API and whether it makes sense to combine with #10179.

@skshetry skshetry linked a pull request Feb 23, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from Todo to Done in DVC Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants