-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming DVC imports #10232
Comments
could you clarify this please? what DVC API is going to be used? |
That's still being worked out in #10164, but you can see there examples of how a |
Got it. I wonder if this is needed (vs people using their own tools to access data the way they want) - should we just have a way to pass some information about the dependency to the user code? (it was asked btw in some other contexts I think). |
Sorry, I don't think I follow how that differs from the example in #10164 where |
Actually, I thought there was only a DVCX example in #10164, but there's also one for streaming DVC imports that looks like this: from dvc.api.dataset import DVCDataset, get
from dvc.fs.dvc import DVCFileSystem
resolved = get(DVCDataset, "stackoverflow")
fs = DVCFileSystem(url=resolved.url, rev=resolved.rev)
with fs.open(resolved.path) as f:
process_posts(f.readlines()) |
okay, I see. I got confused by Good then and makes sense. The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general. |
#10164 will introduce
datasets
as a new type of dependency that aren't based on the local filesystem. This same mechanism can be used to stream data from other DVC repos. Unlikedvc import
, no local copy of the data is needed. Users can specify a revision, freeze it, make it a stage dependency, and stream it into their code using the DVC API.The text was updated successfully, but these errors were encountered: