Support streaming downloads #63

ivgiuliani · 2023-04-11T16:56:55Z

This adds support for a streaming interface for uploads a downloads, though only downloads are supported at the moment.

Requirements considered:

being able to easily access the content of the files in memory as we download it: this is so we can do processing of the content as the file gets downloaded
keep the interface similar to the existing (not stream-based) download/upload, whilst making it easy to iterate on individual chunks: this is to reduce cognitive load on the library

Both GCS and S3 do not natively support streaming[1], so streaming is emulated using partial range downloads. The streaming interface is meant to be as compatible as possible with the non-streaming version:

Non-streamed download

This is the existing interface, no change there. It returns a hash with :bucket, :key and :content.

BucketStore.for("inmemory://bucket/file").download
=> {:bucket=>"bucket", :key=>"file", :content=>...}

Streamed download

The streaming version introduces a .stream for .download/.upload methods. .download takes an optional chunk_size argument to control the size of the chunks (with 4Mb being the default otherwise). .upload is not implemented as part of this PR, but .download returns an enumerable object where each object returned is a lazily downloaded chunk of at most chunk_size. The major difference in this case is that the content is returned outside of the main hash, as this allows to iterate more easily on individual chunks:

BucketStore.for("inmemory://bucket/file").stream.download
=> #<Enumerator: ...>

BucketStore.for("inmemory://bucket/file").stream.download.map { |item| item }
=> [[{:bucket=>"bucket", :key=>"file"}, "chunk 1"], [{:bucket=>"bucket", :key=>"file"}, "chunk 2"], ...]

BucketStore.for("inmemory://bucket/file").stream.download.map { |_, chunk| chunk }
=> ["chunk 1", "chunk 2", ...]

[1] sort of. S3's sdk does, in fact, support streaming downloads. Moreso, there's two ways it does this, but neither of them works well enough for our use cases. The first way is to stream directly to a file-like object (e.g. StringIO), however due to how this is implemented we can't access the content of the file in-memory as we download it. The second way is to use blocks, however when using blocks to download objects, the Ruby sdk will NOT retry failed requests after the first chunk of data has been yielded. Doing so could cause file corruption on the client end by starting over mid-stream. See https://aws.amazon.com/blogs/developer/downloading-objects-from-amazon-s3-using-the-aws-sdk-for-ruby/ for details.

Make sure integration tests always start with a clean slate, regardless of how the previous test run has ended. Note that this can also fail, but it's a good test in itself...

This introduces a new `.stream` interface for upload and download operations. Currently only downloads are supported, and as part of this commit there's only a reference implementation for the in-memory adapter. Requirements considered: - being able to easily access the content of the files in memory as we download it: this is so we can do processing of the content as the file gets downloaded - keep the interface as similar as possible to the existing (not stream-based) download/upload: this is to reduce cognitive load on the library, as the main change between stream/not-stream is to add the .stream keyword and iterate on blocks.

Adds support for streaming reads/downloads in the disk adapter.

Adds support for streaming reads/downloads in the GCS adapter. Note that GCS does _not_ natively support streaming downloads, so we emulate streaming using partial range downloads.

S3's Ruby SDK supports streaming downloads in two ways: - by streaming directly into a file-like object (e.g. `StringIO`). However due to how this is implemented, we can't access the content of the file in-memory as it gets downloaded. This is problematic as it's one of our requirements, since we may want to do file processing as we download it. - by using block-based iteration. This has the big drawback that it will _not_ retry failed requests after the first chunk of data has been yielded, which could lead to file corruption on the client end by starting over emid-stream. (https://aws.amazon.com/blogs/developer/downloading-objects-from-amazon-s3-using-the-aws-sdk-for-ruby/ for more details) Therefore streaming downloads are implemented similarly to the GCS adapter by leveraging partial range downloads. This adds an extra HTTP call to AWS to obtain the overall file size, but is a more robust solution as it both supports retries of individual chunks, but also allows us to inspect the content of the file as we download it.

This also tests that it's possible to customise the chunk size on individual adapters.

ahjmorton · 2023-04-18T13:00:47Z

lib/bucket_store/gcs.rb

+    def stream_download(bucket:, key:, chunk_size: nil)
+      chunk_size ||= DEFAULT_STREAM_CHUNK_SIZE_BYTES


Suggested change

def stream_download(bucket:, key:, chunk_size: nil)

chunk_size ||= DEFAULT_STREAM_CHUNK_SIZE_BYTES

def stream_download(bucket:, key:, chunk_size: DEFAULT_STREAM_CHUNK_SIZE_BYTES)

minor / nitpick, just curious why we're not using the default arg functionality

I think it's a PEBKAC...

The interface for this will change to add a `stream` version, similar to #63. To prepare for that we simplify the `upload` method

Completely stolen from #63. The KeyStreamer expects uploads / downloads to be expressed in terms of IO operations. For example a download is actually `download into this IO` and upload is `upload the content from this file`. Uploads / downloads of Strings are considered a special case of this. However is distinct enough that we have streaming be it's own API.

Expose minio admin console

f637005

ivgiuliani force-pushed the stream-downloads branch 3 times, most recently from dc6cdfe to 8d500d1 Compare April 12, 2023 08:51

ivgiuliani added 3 commits April 12, 2023 09:51

Delete all files in the bucket before starting

ef7856c

Make sure integration tests always start with a clean slate, regardless of how the previous test run has ended. Note that this can also fail, but it's a good test in itself...

Implement stream downloads for the disk adapter

4bb03fe

Adds support for streaming reads/downloads in the disk adapter.

ivgiuliani force-pushed the stream-downloads branch 2 times, most recently from e0866d8 to a204732 Compare April 12, 2023 09:56

ivgiuliani added 4 commits April 12, 2023 13:31

Implement stream downloads for the GCS adapter

f751722

Adds support for streaming reads/downloads in the GCS adapter. Note that GCS does _not_ natively support streaming downloads, so we emulate streaming using partial range downloads.

Include the base bucket uri in the context name

c4a701f

Add tests for large file downloads via the streaming interface

bffaa4a

ivgiuliani force-pushed the stream-downloads branch from a204732 to f5ee1d2 Compare April 13, 2023 08:50

ivgiuliani added 2 commits April 13, 2023 09:51

Add tests for small chunk downloads via the streaming interface

87e46a2

This also tests that it's possible to customise the chunk size on individual adapters.

Update changelog

6109311

ivgiuliani force-pushed the stream-downloads branch from f5ee1d2 to 6109311 Compare April 13, 2023 08:51

ivgiuliani marked this pull request as ready for review April 13, 2023 15:59

ahjmorton reviewed Apr 18, 2023

View reviewed changes

ivgiuliani mentioned this pull request Apr 18, 2023

Support streaming uploads and downloads #64

Merged

ahjmorton pushed a commit that referenced this pull request Apr 18, 2023

Change upload to no longer silently convert Strings to StringIO

9ab5e33

The interface for this will change to add a `stream` version, similar to #63. To prepare for that we simplify the `upload` method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming downloads #63

Support streaming downloads #63

ivgiuliani commented Apr 11, 2023 •

edited

Loading

ahjmorton Apr 18, 2023

ivgiuliani Apr 18, 2023

		def stream_download(bucket:, key:, chunk_size: nil)
		chunk_size \|\|= DEFAULT_STREAM_CHUNK_SIZE_BYTES

Support streaming downloads #63

Are you sure you want to change the base?

Support streaming downloads #63

Conversation

ivgiuliani commented Apr 11, 2023 • edited Loading

Non-streamed download

Streamed download

ahjmorton Apr 18, 2023

Choose a reason for hiding this comment

ivgiuliani Apr 18, 2023

Choose a reason for hiding this comment

ivgiuliani commented Apr 11, 2023 •

edited

Loading