Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement stream downloads for the S3 adapter
S3's Ruby SDK supports streaming downloads in two ways: - by streaming directly into a file-like object (e.g. `StringIO`). However due to how this is implemented, we can't access the content of the file in-memory as it gets downloaded. This is problematic as it's one of our requirements, since we may want to do file processing as we download it. - by using block-based iteration. This has the big drawback that it will _not_ retry failed requests after the first chunk of data has been yielded, which could lead to file corruption on the client end by starting over emid-stream. (https://aws.amazon.com/blogs/developer/downloading-objects-from-amazon-s3-using-the-aws-sdk-for-ruby/ for more details) Therefore streaming downloads are implemented similarly to the GCS adapter by leveraging partial range downloads. This adds an extra HTTP call to AWS to obtain the overall file size, but is a more robust solution as it both supports retries of individual chunks, but also allows us to inspect the content of the file as we download it.
- Loading branch information