You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, thanks! S3-tar has been really useful for archiving some of our buckets.
We have a number that contain a mix of small files and files that are larger than 50% of available RAM. When a large file is encountered the process is killed with an out of memory error. It'd be really great to resolve this.
From a cursory look at the code it seems that the underlying cause of this is the use of io.BytesIO() for in-memory processing both when downloading from S3 and creating parts of the tar, meaning that any files being processed need RAM to be > fileSize*2.
I think multiple algorithms are required depending on file size. Where there are small files, the current process makes sense as caching reduces the total time needed.
However when a large file is encountered, it is probably necessary to pipe directly from the s3 stream to tar and back to s3. This would mean that there will be a limit of one part uploading at a time.
Alternatively some intelligent spooling to disk could be used although this would have the same problem that the maximum file size supported will be limited by the disk.
The text was updated successfully, but these errors were encountered:
Hi,
Firstly, thanks! S3-tar has been really useful for archiving some of our buckets.
We have a number that contain a mix of small files and files that are larger than 50% of available RAM. When a large file is encountered the process is killed with an out of memory error. It'd be really great to resolve this.
From a cursory look at the code it seems that the underlying cause of this is the use of
io.BytesIO()
for in-memory processing both when downloading from S3 and creating parts of the tar, meaning that any files being processed need RAM to be > fileSize*2.I think multiple algorithms are required depending on file size. Where there are small files, the current process makes sense as caching reduces the total time needed.
However when a large file is encountered, it is probably necessary to pipe directly from the s3 stream to tar and back to s3. This would mean that there will be a limit of one part uploading at a time.
Alternatively some intelligent spooling to disk could be used although this would have the same problem that the maximum file size supported will be limited by the disk.
The text was updated successfully, but these errors were encountered: