Implementing a handful of sketching data structures for the fun of it. :) See slides for my talk on sketching data structures for context!
-
Bloom filters are a probabilistic data structure that attempts to answer membership queries -- ie, is the element in the set. A Bloom filter may return false positives, but never returns a false negative.
-
HyperLogLogs count the number of unique elements seen so far in a stream (ie, its cardinality). It uses log(log(cardinality)) space!!
- HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
- HyperLogLog++ paper: http://research.google.com/pubs/pub40671.html
-
Count Min Sketches produces an approximate frequency table of items in a stream. It can overestimate frequency, but will never underestimate. It's very similar to a Bloom Filter in implementation.
-
T-digests are an improvement on Q-digests, which estimate quantiles of a stream with minimal error at the extremes.