Chunked file upload #9

mzur · 2022-05-11T09:14:50Z

I now observed many instances where a file upload ran into the (already extended) 50 s script execution timeout. Passing a multi-GB file on the application and then uploading it to the S3 storage just took too long.

One alternative could be presigned upload URLs but they have the problem that they do not allow file size or MIME type validation.

The only other way is chunked uploads which require additional processing and application logic. Thoughts:

Files are sent in chunks if their size passes a certain threshold (e.g. 100 MB). Use a chunk size of 100 MB.
Chunked files get the additional request parameters chunk_index and chunk_total, specifying the index of the currently uploaded chunk and the total number of chunks of the file.
File chunks are stored in the pending storage disk at their "normal" path but with their chunk index as suffix, e.g. my/file.jpg.0, my/file.jpg.1.
To support validation of chunked files (i.e. max_file_size and if all chunks of a file have been received), we need to introduce a new StorageRequestFile model with the attributes path, size, received_chunks, total_chunks.
Size validation: The size attribute of a file is increased for each uploaded chunk. If the size passes the max_file_size threshold (or the quota), the chunk is rejected with a failed validation and a queued job is dispatched to delete all the existing chunks of the file.
Chunk validation: The received_chunks and total_chunks attributes can tell if all chunks of a file have been received. Reject the submission of a storage request if a file has missing chunks. Also reject the upload of already received chunks (e.g. the first chunk with a different MIME type).
MIME type validation: The first chunk of a file must be uploaded first. It is checked for a valid MIME type. No other chunk is accepted until this check passed.
When a storage request with chunked files is submitted, first, an AssembleChunkedFile job is dispatched for each chunked file (see below). Once all files have been assembled, the review notification is sent do the admins.
The AssembleChunkedFile job merges the chunks of a file, uploads the complete file and deletes the chunks. It also clears received_chunks and total_chunks of the file.

Bonus: We know the size of each file, which is required to implement #8. Also:

The total size of a storage request can be shown in the admin notification, on the review view and in the list of storage requests.
The used quota display can be updated immediately in the list of storage requests.
The used quota of a user can now be determined by adding the size of all their uploaded files in the DB

After implementation:

The increased script execution timeout can be reduced to the defaults again.
Update the multipart upload configurationn of the S3 storage disk

The text was updated successfully, but these errors were encountered:

mzur added this to BIIGLE Roadmap May 11, 2022

mzur moved this to High Priority in BIIGLE Roadmap May 11, 2022

This was referenced May 11, 2022

Improve upload performance #3

Open

Chunked file upload #10

Merged

mzur closed this as completed in #10 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked file upload #9

Chunked file upload #9

mzur commented May 11, 2022 •

edited

Loading

Chunked file upload #9

Chunked file upload #9

Comments

mzur commented May 11, 2022 • edited Loading

mzur commented May 11, 2022 •

edited

Loading