-
Notifications
You must be signed in to change notification settings - Fork 74
Quick User's Guide
Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc.
python-blosc
is a Python package that wraps it.
You can use the PyPI repository with the pip
command line utility:
$ pip install blosc
or, if you prefer compiling the sources yourself, read on.
Assuming that you have a C compiler installed, do:
$ python setup.py build_ext --inplace
This package supports Python 2.6, 2.7 and 3.3 or higher versions.
After compiling, you can quickly check that the package is sane by running:
$ PYTHONPATH=. (or "set PYTHONPATH=." on Win)
$ export PYTHONPATH=. (not needed on Win)
$ python blosc/toplevel.py (add -v for verbose mode)
Install it as a typical Python package:
$ python setup.py install
[Figures below obtained using a VM with only 2 cores on top of a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz]
# Let's create a NumPy array with 80 MB full of data
>>> import numpy as np
>>> a = np.linspace(0, 100, 1e7)
>>> bytes_array = a.tostring() # get a bytes stream
# Blosc as a very fast compressor
>>> import zlib
>>> %time zpacked = zlib.compress(bytes_array)
CPU times: user 4.03 s, sys: 0.03 s, total: 4.06 s
Wall time: 4.08 s # ~ 20 MB/s
>>> import blosc
>>> %time bpacked = blosc.compress(bytes_array, typesize=8)
CPU times: user 0.10 s, sys: 0.00 s, total: 0.11 s
Wall time: 0.05 s # ~ 1.6 GB/s and 80x faster than zlib
>>> time acp = a.copy() # a copy of the actual data (using memcpy() behing the scenes)
CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
Wall time: 0.04 s # ~ 2 GB/s, just a 25% faster than Blosc
# ... that is optimized for compressing binary data ...
>>> len(zpacked)
52994692
>>> len(bytes_array) / float(len(zpacked))
1.5095851486409242 # zlib achieves a 1.5x compression ratio
>>> len(bpacked)
7641156
>>> len(bytes_array) / float(len(bpacked))
10.469620041784253 # blosc reaches more than 10x compression ratio
# Blosc is also extremely fast when decompressing
>>> %time bytes_array2 = zlib.decompress(zpacked)
CPU times: user 0.28 s, sys: 0.02 s, total: 0.30 s
Wall time: 0.31 s # ~ 260 MB/s
>>> %time bytes_array2 = blosc.decompress(bpacked)
CPU times: user 0.07 s, sys: 0.02 s, total: 0.09 s
Wall time: 0.05 s # ~ 1.6 GB/s and 6x times faster than zlib
# You can pack and unpack NumPy arrays very easily too:
>>> packed = blosc.pack_array(a)
>>> a2 = blosc.unpack_array(packed)
>>> np.alltrue(a == a2)
True
Please refer to http://python-blosc.blosc.org/. You can also have a look at docstrings. Start by the main package:
>>> import blosc
>>> help(blosc)
and ask for more help in the referenced functions.
There is an official mailing list for Blosc at:
http://groups.google.es/group/blosc
That's it! Let us know of any bugs, suggestions, gripes, kudos, etc. you may have.