-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to chunk download from object store #274
Comments
I originally submitted this issue in the datafusion repo which I think is the wrong repo. Quote reply from @alamb
|
It should be relatively straightforward to achieve this using buffer_ordered from the futures crate, we may just need to document how to do this |
Maybe it would make a good example |
I can write an example. Using
But it's not obvious for me how to use the stream interface with |
I was imagining that it would look something like making multiple calls to |
A related discussion: |
I believe @crepererum is working on something like this, called "chunked downloading" |
I do. We have code for that at InfluxData and I plan to upstream this in the following order:
|
FWIW my preference would be to build this into the store implementations, e.g. into GetClient, as opposed to adding further wrapper types. I'd very much like to move away from wrapping things at the ObjectStore interface. Edit: Actually my real preference would be to build this into something akin to the buffered interfaces as opposed to baking it into ObjectStore at all. This would allow for out of order chunking, avoid the issue of providing size and Etag information, and generally be far more flexible... |
What do you mean by "buffered interfaces" ? I mean a more general implementation sounds great, but if we have one that is implemented as an |
I am referring to things like BufReader.
My understanding from Marco's comment is that we would need to use the extension mechanism in order to get the size (and possibly ETag) through to the wrapper. Given this already implies a non-standard invocation of the ObjectStore::get API by the caller, I don't really see the advantage over using a separate utility helper akin to BufReader in order to achieve this. We avoid overloading the ObjectStore interface, can return data out of order, and have a more clean and focused API.
TBC I am not suggesting an initial cut needs to implement all of the above, but that we should adopt an approach to this issue that allows for this down the line. Tbh the utility approach should be significantly simpler than an ObjectStore wrapper. |
The utility approach certainly looked nice with an alternate tokio runtime |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When downloading large objects (> 300MBs) using object_store crate, I often hit timeout using the default configuration (30 seconds connection timeout). Interestingly, when increasing the timeout, the download speed is actual lower (not sure if it's the same for everyone?).
Describe the solution you'd like
I am thinking if it makes sense to chunk a file into smaller ranges (say, 100MB each), and in parallel, download each range with different connection and reconcile them under the same interface.
Describe alternatives you've considered
Not sure if such a capability can be composed using the existing interfaces.
Additional context
The text was updated successfully, but these errors were encountered: