Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked file upload #9

Closed
mzur opened this issue May 11, 2022 · 0 comments · Fixed by #10
Closed

Chunked file upload #9

mzur opened this issue May 11, 2022 · 0 comments · Fixed by #10

Comments

@mzur
Copy link
Member

mzur commented May 11, 2022

I now observed many instances where a file upload ran into the (already extended) 50 s script execution timeout. Passing a multi-GB file on the application and then uploading it to the S3 storage just took too long.

One alternative could be presigned upload URLs but they have the problem that they do not allow file size or MIME type validation.

The only other way is chunked uploads which require additional processing and application logic. Thoughts:

  • Files are sent in chunks if their size passes a certain threshold (e.g. 100 MB). Use a chunk size of 100 MB.
  • Chunked files get the additional request parameters chunk_index and chunk_total, specifying the index of the currently uploaded chunk and the total number of chunks of the file.
  • File chunks are stored in the pending storage disk at their "normal" path but with their chunk index as suffix, e.g. my/file.jpg.0, my/file.jpg.1.
  • To support validation of chunked files (i.e. max_file_size and if all chunks of a file have been received), we need to introduce a new StorageRequestFile model with the attributes path, size, received_chunks, total_chunks.
  • Size validation: The size attribute of a file is increased for each uploaded chunk. If the size passes the max_file_size threshold (or the quota), the chunk is rejected with a failed validation and a queued job is dispatched to delete all the existing chunks of the file.
  • Chunk validation: The received_chunks and total_chunks attributes can tell if all chunks of a file have been received. Reject the submission of a storage request if a file has missing chunks. Also reject the upload of already received chunks (e.g. the first chunk with a different MIME type).
  • MIME type validation: The first chunk of a file must be uploaded first. It is checked for a valid MIME type. No other chunk is accepted until this check passed.
  • When a storage request with chunked files is submitted, first, an AssembleChunkedFile job is dispatched for each chunked file (see below). Once all files have been assembled, the review notification is sent do the admins.
  • The AssembleChunkedFile job merges the chunks of a file, uploads the complete file and deletes the chunks. It also clears received_chunks and total_chunks of the file.

Bonus: We know the size of each file, which is required to implement #8. Also:

  • The total size of a storage request can be shown in the admin notification, on the review view and in the list of storage requests.
  • The used quota display can be updated immediately in the list of storage requests.
  • The used quota of a user can now be determined by adding the size of all their uploaded files in the DB

After implementation:

  • The increased script execution timeout can be reduced to the defaults again.
  • Update the multipart upload configurationn of the S3 storage disk
@mzur mzur moved this to High Priority in BIIGLE Roadmap May 11, 2022
This was referenced May 11, 2022
@mzur mzur closed this as completed in #10 May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant