Goal:
klib
is a C macro-based fast data structure library including a hash map called
khash. This is heavily used in bioinformatic algorithms like the short-read
aligner BWA-MEM. This project uses the
benchmark
used by the klib author (Heng Li) to compare kash to the Rust
std::collections::HashMap
.
git clone https://github.com/gaberudy/hash_test.git
cd hash_test
cargo build --release
./target/release/hash_test --help
./target/release/hash_test
# items | type | alg | time | mem
-------------------------------------------
5000000 | int | rust | 0.847 | 24.6M
5000000 | int | khash | 0.262 | 32.7M
5000000 | str | rust | 1.298 | 47.4M
5000000 | str | khash | 0.951 | 32.8M
Rust is looking fast, but not matching Heng Li’s hand-crafted C code for this specific benchmark. I’m thinking the klib/khash may have a couple advantages:
-
Hash function speed. Although I tried various hash functions, and I don’t believe they are the bottleneck, I didn’t yet implement the khash function in rust (or in C and link it in). The current benchmarks are run using
FnvHashMap
from thefnv
crate, which uses the HashMap data structure and its ability to specify a different hash function (the default is cryptographically secure and fnv is also used heavily in rustc). I also tested every hash functin in the fasthash create, and none did better thanfnv
for this bechmark. -
Data types used for keys. Rust use u64 as the output of the hash functions and thus the data type for the HashMap keys. Khash is using C “int” (32-bit). This I believe accounts for most of the mem and potentially performance difference. It would be interesting to build as HashMap32 that truncates all 8-byte keys and stores them as 4-byte u32 to compare to kash.
-
In particular khash does not hash int as key types, uses them directly as keys to the hash table! I think this explains the performance compared to hashing ints in rust, but note rust gets better space efficiency with the hashed integers than the less distributed in the key-space input integers.