C Implementation of a lossless file compression suite, using the Lempel-Ziv-Welch (LZW) algorithm,
with on-par or better compression performance than the native UNIX compress
utility.
Benchmarks (See Testing) suggest that for intermediate code sizes (MAXBITS
= 11 to 13),
this implementation achieves about 7-12% higher compression ratios at the expense of a significant increase in compression time.
For smaller or much larger code sizes, the compression ratios are roughly on par as compress
,
but still have a significant compute time penalty.
To guard against this time penalty, string table pruning only occurs for MAXBITS > 10
, which reduces computation time in half.
Based on this testing, the recommended (and default) setting for MAXBITS
is 12.
The LZW algorithm is a universal lossless compression algorithm that with very high throughput.
It is used by the UNIX utility compress
and uncompress
as well as in GIFs.
The algorithm works by constructing a string table of (code, prefix, and character) triples,
which allows a long substring to be represented using a smaller code.
Because the string table is conducted on the fly in both compression and decompression,
the compressed file doesn't need to contain these mappings, allowing it to achieve high compression ratios.
From the root of the repository, run make
with gcc
installed to compile the source code into the executable binaries.
./compress [-m MAXBITS] < input > output\n
./decompress < input > output
MAXBITS
is the largest number of bits a code can be represented with when compressing (defaults to 12).
The string table can therefore never be larger than MAXBITS
flag isn't passed as the compressed file stores its value.
When the DBG
environment variable is set to 1, compress and decompress will dump human readable
versions of the final string tables to DBG.compress
and DBG.decompress
, respectively.
./compress < "foo.txt" > "compressed_foo.txt"
./decompress < "compressed_foo.txt" > "foo_copy.txt"
The repository includes the testing script as well as the files used for testing the compressor.
Note that the testing script requires the ncompress
utility to be installed,
which may or not be natively installed.
To run the test yourself, execute the following command from the root of the repository:
./tests/test_script.sh [[-m MAXBITS] [-d] [-b]]
The -d
flag sets DBG=1
and the -b
flag stops further tests after the first error.
The testing files come courtesy of the testing data for Snappy, a compressor/decompressor from Google. The files come from a variety of sources including the Canterbury Corpus.