Skip to content

A raw, simple and effective golang implementation of a Bloom Filter without any dependencies

License

Notifications You must be signed in to change notification settings

toniphan21/go-bf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go Bloom Filter

codecov Codacy Badge Go Report Card Go Reference

A Bloom Filter written in GoLang with flexible options and no dependencies.

Usage

Initialize a Bloom Filter WithAccuracy configuration

package main

import "github.com/toniphan21/go-bf"

func main() {
	var errorRate = 0.001
	var numberOfItems uint32 = 10_000_000
	filter := bf.Must(bf.WithAccuracy(errorRate, numberOfItems))

	filter.Add([]byte("anything"))

	if !filter.Exists([]byte("not found")) {
		println("If a bloom filter returns false, it is 100% correct.")
	}

	if filter.Exists([]byte("anything")) {
		println("If a bloom filter returns true it MAYBE correct. Remember to check false positive cases.")
	}
}

Initialize a Bloom Filter WithCapacity configuration

package main

import "github.com/toniphan21/go-bf"

func main() {
	var capacityInBits uint32 = 65_536
	var numberOfHashFunctions byte = 5
	filter := bf.Must(bf.WithCapacity(capacityInBits, numberOfHashFunctions))

	filter.Add([]byte("anything"))

	if !filter.Exists([]byte("not found")) {
		println("If a bloom filter returns false, it is 100% correct.")
	}

	if filter.Exists([]byte("anything")) {
		println("If a bloom filter returns true it MAYBE correct. Remember to check false positive cases.")
	}
}

APIs

Constructors

  • New(Config, ...Options) (BloomFilter, error) initialize new instance
  • Must(Config, ...Options) BloomFilter initialize new instance, panic if encounter any error

BloomFilter interface

The BloomFilter interface has 8 main methods:

Method Description
Add([]byte) Add an item into the filter
Exists([]byte) bool Check existence of an item in the filter
Count() int Get number of items added into the filter. Return -1 if not sure (for example after using Intersect() or Union()
Clone() (BloomFilter, error) Create new BloomFilter instance with the same storage, hasher and data
Intersect(BloomFilter) error Intersect with given filter. They must use the same Storage and Hash. Only Storage's data of current filter is affected, given filter's data is not
Union(BloomFilter) error Union with given filter. They must use the same Storage and Hash. Only Storage's data of current filter is affected, given filter's data is not
Storage() Storage Get filter's Storage
Hasher() Hasher Get filter's Hash

Options

There are 4 option functions could be used from the second param of bf.New(Config, ...OptionFunc):

Signature Description
WithSHA() default Use splitted SHA hashing strategy (more uniform hash)
WithFNV() Use splitted FNV hashing strategy (better performance)
WithHasher(f HasherFactory) Customize Hashing strategy with a HasherFactory
WithStorage(f StorageFactory) Customize Storage strategy with a StorageFactory

Implementation Details

Error rate, number of hash functions calculation

This is just a summary of how to calculate estimated error rate of a Bloom Filter, for proof please check academic paper or wikipedia page.

Given:

  • n estimated number of items in a Bloom Filter
  • e false positive error rate
  • k number of hash functions needed
  • m bits of memory

If you have n and e (WithAccuracy config):

$$\huge k \approx -log_2 e$$ $$\huge m \approx -1.44 n log_2 e$$

WithAccuracy has a builtin .Info() returns calculated m, k and rounded values.

package main

import (
	"fmt"
	"github.com/toniphan21/go-bf"
)

func main() {
	var errorRate = 0.001
	var numberOfItems uint32 = 10_000_000
	config := bf.WithAccuracy(errorRate, numberOfItems)
	fmt.Println(config.Info())
}

/** OUTPUT:

Config WithAccuracy()
  - Requested error rate: 0.10000%
  - Expected number of items: 10000000
  - Bits per item: 14.351
  - Number of hash functions: 10
  - Size in bits of each has function: 28
  - Storage capacity: 143507294 bits = 17938412 bytes = 17517.98KB = 17.11MB
  - Estimated error rate: 0.10130%

*/

If you have m and k (WithCapacity config), estimated error rate when n items are added into a Bloom Filter:

$$\huge \epsilon \approx (1 - e^-\frac{k n}{m} )^k$$

WithCapacity configuration has a builtin .Info() to display the estimated error rate when config by m and k.

package main

import (
	"fmt"
	"github.com/toniphan21/go-bf"
)

func main() {
	var capacityInBits uint32 = 65_536
	var numberOfHashFunctions byte = 5
	config := bf.WithCapacity(capacityInBits, numberOfHashFunctions)
	fmt.Println(config.Info())
}

/** OUTPUT:

Config WithCapacity()
  - Storage capacity: 65536 bits = 8192 bytes = 8.00KB = 0.01MB
  - Number of hash functions: 5
  - Size in bits of each has function: 16
  - Estimated error rate by n - number of added items:
      n=   100; estimated error rate: 0.00000%
      n=   200; estimated error rate: 0.00000%
      n=   500; estimated error rate: 0.00001%
      n=  1000; estimated error rate: 0.00021%
      n=  2000; estimated error rate: 0.00568%
      n=  5000; estimated error rate: 0.32083%
      n= 10000; estimated error rate: 4.33023%
      n= 20000; estimated error rate: 29.35056%
      n= 50000; estimated error rate: 89.45317%
      n=100000; estimated error rate: 99.75726%
      n=200000; estimated error rate: 99.99988%
      n=500000; estimated error rate: 100.00000%

 */

If you spot something wrong with the calculation, no worries - you can write your own config.

Hashing strategy

This library has builtin 2 hashers with the same strategy:

  • From the config we could know: keySize minimum key size (in bits) and keyCount number of hash function needed.
  • When BloomFilter calls Hasher.Hash() it passes count - number of keys array needed
  • Use SHA-256 or FNV-128 to generate hash bytes from the input. If count * keySize * keyCount > 256 when use SHA (or 128 when use FNV), will do the hash multiple times with a byte prefixed
  • Pick count * keyCount * keySize bits from the hash bytes in previous step, discard all remaining bits.

Example 1: count = 1, keySize = 25, keyCount = 10, use SHA-256:

  • Because 1*25*10 = 250 bits, we only need to hash 1 time
  • hash(256 bits) = sha_hash(input)
  • pick key 0 = bit 0-24
  • pick key 1 = bit 25-49
  • ...
  • pick key 9 = bit 225-249
  • bit 250-255 is discarded
  • return [1][10]uint32{ {key0, key1...key9} }

Example 2: count = 1, keySize = 25, keyCount = 10, use FNV-128:

  • Because 1*25*10 > 128, we hash input 2 times
  • hash(256 bits) = fnv_128(input) + fnv_128(byte(0) + input)
  • pick key 0 = bit 0-24
  • pick key 1 = bit 25-49
  • ...
  • pick key 9 = bit 225-249
  • bit 250-255 is discarded
  • return [1][10]uint32{ {key0, key1...key9} }

Example 3: count = 2, keySize = 25, keyCount = 10, use FNV-128:

  • Because 2*25*10 = 500 > 128 bits, we need to hash 4 times
  • hash(512 bits) = fnv_128(input) + fnv_128(byte(0) + input) + fnv_128(byte(1) + input)+ fnv_128(byte(2) + input)
  • pick key 0 = bit 0-24
  • pick key 1 = bit 25-49
  • ...
  • pick key 9 = bit 225-249
  • ...
  • pick key 19 = bit 475-499
  • bit 500-512 is discarded
  • return [2][10]uint32{ {key0, key1...key9}, {key10...key19} }

This strategy guarantees that the first n keys will always be the same independent from count passed via Hasher.Hash() function. A classic BloomFilter always use count=1 but the other variants may use more than 1 key collection.

Customization

Write your own hashing strategy

By default, an SHAHasher is used. You can customize the Hasher functions by implement Hasher and HasherFactory interface:

package main

import "github.com/toniphan21/go-bf"

type YourHasher struct {
  keyCount byte
  keySize  byte
}

func (y *YourHasher) Equals(other bf.Hasher) bool {
  o, ok := other.(*YourHasher)
  if !ok {
    return false
  }
  // check other params
  return y.keyCount == o.keyCount && y.keySize == o.keySize
}

func (y *YourHasher) Hash(input []byte, count int) [][]bf.Key {
  // return an array of a hash array for given bytes input.
  //   - length of the array is count
  //   - each of subarray will have length = keyCount - number of hash functions
  //   - each hash need to >= keySize - minimum size of a hash in bits
  // For example: given keyCount = 5, keySize = 16
  //   - count = 1 requires you to returns:
  //     [][]Key{
  //       { "key0: at least 16 bits long", "key1:...", "key2:...", "key3:...", "key4:..."},
  //     }
  //   - count = 2 requires you to returns:
  //     [][]Key{
  //       { "key0: at least 16 bits long", "key1:...", "key2:...", "key3:...", "key4:..."},
  //       { "key5: at least 16 bits long", "key6:...", "key7:...", "key7:...", "key8:..."},
  //     }
  return [][]bf.Key{}
}

type YourHasherFactory struct{}

func (y *YourHasherFactory) Make(numberOfHashFunctions, hashSizeInBits byte) bf.Hasher {
  return &YourHasher{
    keyCount: numberOfHashFunctions,
    keySize:  hashSizeInBits,
  }
}

func main() {
  config := bf.WithAccuracy(0.01, 1_000_000)
  filter := bf.Must(config, bf.WithHasher(&YourHasherFactory{}))

  filter.Add([]byte("anything"))
  // ...
}

Write your own custom storage

By default, all data are stored in memory, you can customize a storage by implement Storage and StorageFactory interface:

package main

import "github.com/toniphan21/go-bf"

type FileStorage struct {
  capacity uint32
}

func (f *FileStorage) Intersect(other bf.Storage) {
  // implement BatchIntersect to perform Intersect operator faster
}

func (f *FileStorage) Union(other bf.Storage) {
  // implement BatchUnion to perform Union operator faster
}

func (f *FileStorage) Equals(other bf.Storage) bool {
  o, ok := other.(*FileStorage)
  if !ok {
    return false
  }
  // check other params
  return o.capacity == f.capacity
}

func (f *FileStorage) Set(index uint32) {
  // set a bit in the given index to true
}

func (f *FileStorage) Clear(index uint32) {
  // clear a bit in the given index to true
}

func (f *FileStorage) Get(index uint32) bool {
  // return a boolean in the given index
  return false
}

func (f *FileStorage) Capacity() uint32 {
  // return the capacity of the storage in bits
  return f.capacity
}

type FileStorageFactory struct{}

func (f *FileStorageFactory) Make(capacity uint32) (bf.Storage, error) {
  return &FileStorage{capacity}, nil
}

func main() {
  config := bf.WithAccuracy(0.01, 1_000_000)
  filter := bf.Must(config, bf.WithStorage(&FileStorageFactory{}))

  filter.Add([]byte("anything"))
  // ...
}

Write your own config

If you don't like WithCapacity() or WithAccuracy() configuration, you can write your own:

package main

import "github.com/toniphan21/go-bf"

type YourConfig struct {
}

func (y *YourConfig) Info() string {
  return "info about your config"
}

func (y *YourConfig) NumberOfHashFunctions() byte {
  return 5
}

func (y *YourConfig) StorageCapacity() uint32 {
  return 1_000_000
}

func (y *YourConfig) KeySize() byte {
  return 32
}

func main() {
  config := &YourConfig{}
  filter, err := bf.New(config)
  if err != nil {
    panic("Something went wrong")
  }

  filter.Add([]byte("anything"))
  // ...
}

Benchmark

BenchmarkBloomFilter_WithSHA_Add
BenchmarkBloomFilter_WithSHA_Add-12       	 1027905	      1152 ns/op
BenchmarkBloomFilter_WithFNV_Add
BenchmarkBloomFilter_WithFNV_Add-12       	 2113863	       573.3 ns/op
BenchmarkBloomFilter_WithSHA_Exists
BenchmarkBloomFilter_WithSHA_Exists-12    	 1000000	      1132 ns/op
BenchmarkBloomFilter_WithFNV_Exists
BenchmarkBloomFilter_WithFNV_Exists-12    	 2175315	       549.9 ns/op
BenchmarkBloomFilter_Intersect
BenchmarkBloomFilter_Intersect/bench
BenchmarkBloomFilter_Intersect/bench-12   	   10000	    106422 ns/op
BenchmarkBloomFilter_Union
BenchmarkBloomFilter_Union/bench
BenchmarkBloomFilter_Union/bench-12       	   10000	    106651 ns/op
BenchmarkBloomFilter_Clone
BenchmarkBloomFilter_Clone/bench
BenchmarkBloomFilter_Clone/bench-12       	    6856	    168797 ns/op

About

A raw, simple and effective golang implementation of a Bloom Filter without any dependencies

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages