Arithmetic coding library for Clojure and Clojurescript.
shannon
artifacts are
released to clojars.
If you are using Maven, add the following repository definition to
your pom.xml
:
<repository>
<id>clojars.org</id>
<url>http://releases.clojars.org/repo</url>
</repository>
Or with Leiningen:
[com.joshuagriffith/shannon "0.1.1"]
Shannon registers default polymorphic compressors for most standard Clojure and Clojurescript types:
(use 'shannon.compressor)
(def out (compress [1 2 3 {:a :b}]))
compress
will return a ByteArray
in Clojure and a #js [...]
array of unsigned bytes in Clojurescript. To decompress:
(decompress out) ; ⇒ [1 2 3 {:a :b}]
(count out) ; ⇒ 11 (bytes)
Shannon also supports coding with several discrete distributions. For example, to create a Zipf-distributed coder for the first million integers:
(use 'shannon.coding-primitives)
(def z (zipf 1000000))
(compress 0 z) ; ⇒ compresses to 1 byte
(compress 999999 z) ; ⇒ compresses to 4 bytes
As expected, smaller numbers require less space than larger numbers, because the Zipf distribution assumes that smaller numbers are more probable. Compare the above behavior with the result of using a uniform distribution:
(def u (uniform 1000000))
(compress 0 u) ; ⇒ compresses to 3 bytes
(compress 999999 u) ; ⇒ compresses to 3 bytes
Coders are composeable. For example, to encode a sequence of up to 20
integers with the distribution previously defined by z
, where the
number of integers is uniformly distributed:
(def count-coder (uniform (inc 20)))
(def arr-coder (variable-array z count-coder))
(def o (compress (range 13) arr-coder)) ; ⇒ compresses to 11 bytes
(decompress o arr-coder) ; ⇒ (0 1 2 3 4 5 6 7 8 9 10 11 12)
Here's a coder for a tuple consisting of an english string, the
variable sequence of integers coded by arr-coder
above, and a date:
(use 'shannon.base-coders)
(def tuple-coder (fixed-array [english-coder arr-coder date-coder]))
(def t ["Hello world!" (range 10) #inst "2014-01-01T00:00:00.000Z"])
(def o (compress t tuple-coder))
; ⇒ compresses to 24 bytes
(decompress o tuple-coder)
; ⇒ ("Hello world!" (0 1 2 3 4 5 6 7 8 9) #inst "2014-01-01T00:00:00.000-00:00")
Compare this with the default coder, which has to store type information and uses integers with Long ranges:
(compress t) ; ⇒ compresses to 34 bytes
Currently, this library is more useful as a set of cross-platform, composeable encoding primitives rather than as a high-performance, universal compression library. Practical compression and runtime performance requires the addition of adaptive distribution primitives with caching rather than using generic, static distributions for each datatype. In addition, the following would be useful:
- Tagged language strings
- Tagged custom types
- Clojure/Clojurescript integration tests
- Improved documentation
- mathematicalmonk has an information theory primer, which includes a detailed walk-through of writing a practical arithmetic coder
- A New Approximation Formula for Computing the N-th Harmonic Number
- Coefficients for the Lanczos Approximation to the Gamma Function
- Information Theory, Inference and Learning Algorithms by David J. C. MacKay. Chapter 6 has an overview of stream coding.
- Numerical Recipes: The Art of Scientific Computing
- Probability Theory: The Logic of Science by E. T. Jaynes
- MIT 6.050J: Information and Entropy
- MIT 6.450: Principles of Digital Communications I
Copyright © 2013–2014 Joshua B. Griffith.
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.