Getting fileobj and unpacking in stream thus avoiding memory overheads. #570

Xezed · 2021-11-21T23:32:14Z

Xezed
Nov 21, 2021

Hey guys 🤓

This is my solution for .zip format.

with ZipFile(object_name) as zip_f:
    for filename in zip_f.namelist():
        new_object_key = f'{folder_path}/{filename}'
        s3_resource.meta.client.upload_fileobj(
            zip_f.open(filename),
            Bucket=data_lake_bucket,
            Key=new_object_key,
        )

Want to do the same for 7z.

I just need a way to get a fileobj.
Why? To avoid RAM or storage overheads and unpack straight into S3.

Any possibility to do so?

miurahr · 2021-11-22T06:32:17Z

miurahr
Nov 22, 2021
Maintainer

I don't have a solution but I'd like to leave a note.

zip archive is non solid compression archive. Each files are compressed to its own blocks.
It is why zipfile can provide file object for compressed file without many overhead.

7z archive is solid archive in default. All data blocks are concatenated and compressed into a single solid block.

https://en.wikipedia.org/wiki/Solid_compression

If you want a file-object to target archived file, py7zr need to extract all archives before the specified file to make a status just before extracting target file.

0 replies

Xezed · 2021-11-22T15:58:25Z

Xezed
Nov 22, 2021
Author

Thank you for the answer 😃
I need to unpack the whole archive somewhere anyways. Either RAM or storage, right?

0 replies

miurahr · 2021-11-24T08:40:01Z

miurahr
Nov 24, 2021
Maintainer

I need to unpack the whole archive somewhere anyways. Either RAM or storage, right?

You are right.
For memory, readall method can be used. for storage, you can use extractall for it.

0 replies

Xezed · 2021-11-25T07:50:11Z

Xezed
Nov 25, 2021
Author

Thank you. That's all I wanted to know. 🤓

0 replies

99991 · 2022-02-21T16:29:28Z

99991
Feb 21, 2022

I wanted to read a huge 7z file as well. The following seems to work if you can process the data in chunks:

import py7zr

path = "test.7z"

with py7zr.SevenZipFile(path, "r") as z:
    for f in z.files:
        if f.is_directory: continue

        folder = f.folder

        decompressor = folder.get_decompressor(f.compressed)

        remaining = f.uncompressed
        while remaining > 0:
            chunk = decompressor.decompress(z.fp, remaining)

            # Do something with "chunk" of "f.filename" here,
            # like send to a server or something.

            remaining -= len(chunk)

            if remaining <= 0:
                break
            else:
                print(f"decompressing: {remaining * 1e-6:16.6f} MB of {f.filename} remaining")

Alternatively, many APIs are happy with a file object that only supports read (e.g. shutil.copyfileobj), so you can just wrap the decompressor object:

import py7zr, shutil, os

path = "test.7z"

class ReadOnlyFile:
    def __init__(self, f, z):
        folder = f.folder
        decompressor = folder.get_decompressor(f.compressed)

        self.decompressor = decompressor
        self.remaining = f.uncompressed
        self.z = z
        self.chunk = b""
        self.offset = 0

    def read(self, size=-1):
        if self.remaining <= 0: return b""

        # Buffer a new chunk if current chunk is empty.
        if self.offset >= len(self.chunk):
            self.chunk = self.decompressor.decompress(self.z.fp, self.remaining)
            self.offset = 0

        if size < 0:
            # Return entire chunk if caller does not care about size.
            chunk = self.chunk
            self.chunk = b""
        else:
            # Return as much of chunk as is available.
            available = min(size, len(self.chunk) - self.offset)
            chunk = self.chunk[self.offset : self.offset + available]
            self.offset += available
            self.remaining -= available

        return chunk

# Example to unzip all files in a 7z file.
with py7zr.SevenZipFile(path, "r") as z:
    for f in z.files:
        if f.is_directory: continue

        src = ReadOnlyFile(f, z)

        assert ".." not in f.filename
        d = os.path.dirname(f.filename)
        if d:
            os.makedirs(d, exist_ok=True)

        # Do something with the src file object, e.g. copy it to an output file.
        with open(f.filename, "wb") as dst:
            shutil.copyfileobj(src, dst)

0 replies

miurahr · 2022-02-25T00:24:03Z

miurahr
Feb 25, 2022
Maintainer

@99991 Unfortunately 7-zip file format is designed to place compression metadata information at end of file. Anyone can not know a chunk positions, where LZMA block is started and end, before downloading all the file data.

0 replies

99991 · 2022-02-25T07:11:04Z

99991
Feb 25, 2022

@miurahr I think we are talking about different problems here. My goal was to extract the files from an already downloaded 7z archive in a streaming manner, since I still had enough space to store the compressed archive, but not the uncompressed archive.

Your description sounds like decompressing the data while it is being downloaded. This is a harder problem, but I think it would still be possible in some cases, since many HTTP servers support range requests. This way you can specify that you only want to download a certain range of bytes; just the end, for example. I implemented something like that for Zip files, which also store the offsets at the end of the file. With zipfile.ZipFile, you can simply pass your own file object based on HTTP range requests to only download parts of a file: https://github.com/apple/ml-hypersim/blob/c3784f3818d824ce90284812dcbb1c805e48dea0/contrib/99991/download.py#L550

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting fileobj and unpacking in stream thus avoiding memory overheads. #570

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Getting fileobj and unpacking in stream thus avoiding memory overheads. #570

Xezed Nov 21, 2021

Replies: 7 comments

miurahr Nov 22, 2021 Maintainer

Xezed Nov 22, 2021 Author

miurahr Nov 24, 2021 Maintainer

Xezed Nov 25, 2021 Author

99991 Feb 21, 2022

miurahr Feb 25, 2022 Maintainer

99991 Feb 25, 2022

Xezed
Nov 21, 2021

miurahr
Nov 22, 2021
Maintainer

Xezed
Nov 22, 2021
Author

miurahr
Nov 24, 2021
Maintainer

Xezed
Nov 25, 2021
Author

99991
Feb 21, 2022

miurahr
Feb 25, 2022
Maintainer

99991
Feb 25, 2022