-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow passing a size-hint to s3.download_fileobj #3083
Comments
Hi @ronkorving, Thanks for the feature request! I think this is reasonable. We'll leave this open to track interest for the time being, so if anyone is interested, please leave a reaction on the original post. |
I had the same problem and I worked around it like this: from boto3.s3.transfer import S3Transfer, BaseSubscriber, S3TransferRetriesExceededError, RetriesExceededError
def download_file(transfer: S3Transfer, bucket_name: str, key_name: str, key_size: int, download_path: str):
"""
This is a workaround to provide the key size to the transfer routine to avoid a HEAD
request for every file we download.
See https://github.com/boto/boto3/issues/3083
This is an override of the method 'download_file' from S3Transfer.
"""
class ProvideKeySize(BaseSubscriber):
def on_queued(self, future, **kwargs):
future.meta.provide_transfer_size(key_size)
future = transfer._manager.download(bucket_name, key_name, download_path, None, [ProvideKeySize()])
try:
future.result()
# This is for backwards compatibility where when retries are
# exceeded we need to throw the same error from boto3 instead of
# s3transfer's built in RetriesExceededError as current users are
# catching the boto3 one instead of the s3transfer exception to do
# their own retries.
except S3TransferRetriesExceededError as e:
raise RetriesExceededError(e.last_exception) I'm re-defining I believe accepting |
@panthony I commend you for having found this workaround :) I hope AWS sees it as another confirmation that there's a real need to be addressed here. |
Hi all, I reviewed this with the team today and they agreed that this seems reasonable and is likely something we'll implement if this gets significant traction. |
I just ran into this as well, downloading a long list of files from S3. I was surprised to find that there always was a HEAD request before the GET request in our service logs. I guess this explains it? I'm not sure why this is even necessary at all since the GET request will return the size in the headers which can be read before streaming the body. |
Is there a way to download using just a |
Dug a bit more and noticed that this behavior could be removed (and I'd argue should become the default) |
Is your feature request related to a problem? Please describe.
When calling
download_fileobj
, thetransfer_future.meta.size
is unsurprisinglyNone
, causing ahead_object
to take place. Only to fetch the size of the object. You can see this here: https://github.com/boto/s3transfer/blob/ccb71ddd89149a4bc5a45a2fcd5e42aafba3f0ea/s3transfer/download.py#L339-L348In cases where you're calling
download_fileobj
on many files following a list-operation -- which already provides object sizes in its response! -- this seems rather wasteful, and causes overall latency to hurt measurably. This is especially visible when dealing with many small objects.I only looked into
download_fileobj
, but I can imagine the same applies todownload_file
, and possibly other scenarios. I understand that me using unmanaged downloads would avoid this problem altogether, but I'm not really asking for workarounds.Describe the solution you'd like
If
download_fileobj
, either viaTransferConfig
, or perhaps better, viaExtraArgs
would accept a "size hint" that we could provide following a list-operation, that head-request could be avoided, latency would drop and cost (Lambda execution time and S3 requests) would decrease.The text was updated successfully, but these errors were encountered: