Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
git clone --recurse-submodules https://github.com/refresh-bio/kmer-db
cd kmer-db && gmake
INPUT=./test/virus
OUTPUT=./output
mkdir $OUTPUT
# build a database from all 18-mers (default) contained in a set of sequences
./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db
# establish numbers of common k-mers between new sequences and the database
./bin/kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv
# calculate jaccard index from common k-mers
./bin/kmer-db distance jaccard $OUTPUT/n2a.csv $OUTPUT/n2a.jaccard
# extend the database with new sequences
./bin/kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db
# establish numbers of common k-mers between all sequences in the database
./bin/kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv
# build a database from 10% of 25-mers using 16 threads
./bin/kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db
# establish number of common 25-mers between single sequence and the database
# (minhash filtering that retains 10% of MT159713 k-mers is done prior to the comparison)
./bin/kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv
# build two partial databases
./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.parts1.db
./bin/kmer-db build $INPUT/seqs.part2.list $OUTPUT/k18.parts2.db
# establish numbers of common k-mers between all sequences in the databases,
# computations are done in the sparse mode, the output matrix is also sparse
echo $OUTPUT/k18.parts1.db > $OUTPUT/db.list
echo $OUTPUT/k18.parts2.db >> $OUTPUT/db.list
./bin/kmer-db all2all-parts $OUTPUT/db.list $OUTPUT/k18.parts.csv
Kmer-db comes with a set of precompiled binaries for Linux, macOS, and Windows. The software is also available on Bioconda:
conda install -c bioconda kmer-db
For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. Kmer-db can be also built from the sources distributed as:
- GNU Make project for Linux and macOS (gmake 4.3 and gcc/g++ 11 or newer required),
- Visual Studio 2022 solution for Windows.
Kmer-db can be built for x86-64 and ARM64 8 architectures (including Apple Mx based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting PLATFORM
variable for make
:
make PLATFORM=none # unspecified platform, no extensions
make PLATFORM=sse2 # x86-64 with SSE2
make PLATFORM=avx # x86-64 with AVX
make PLATFORM=avx2 # x86-64 with AVX2 (default)
make PLATFORM=native # x86-64 with AVX2 and native architecture
make PLATFORM=arm8 # ARM64 8 with NEON
make PLATFORM=m1 # ARM64 8.4 (especially Apple M1) with NEON
Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance.
kmer-db <mode> [options] <positional arguments>
Kmer-db operates in one of the following modes:
build
- building a database from samples,all2all
- counting common k-mers - all samples in the database,all2all-sp
- counting common k-mers - all samples in the database (sparse computation),all2all-parts
- counting common k-mers - all samples in the database parts (sparse computation),new2all
- counting common k-mers - set of new samples versus database,one2all
- counting common k-mers - single sample versus database,distance
- calculating similarities/distances,minhash
- storing minhashed k-mers.
Common options:
-t <threads>
- number of threads (default: number of available cores),
The meaning of other options and positional arguments depends on the selected mode.
Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:
-
compressed or uncompressed genomes/reads:
kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] [-extend] [-t <threads>] <sample_list> <database>
-
KMC-generated k-mers:
kmer-db build -from-kmers [-f <fraction>] [-extend] [-t <threads>] <sample_list> <database>
-
minhashed k-mers produced by
minhash
mode:kmer-db build -from-minhash [-extend] [-t <threads>] <sample_list> <database>
Parameters:
sample_list
(input) - file containing list of samples in the following format:By default, the tool requires uncompressed or compressed FASTA files for each sample. If a file on the list cannot be found, the package tries adding the following extensions: fna, fasta, gz, fna.gz, fasta.gz . Whensample_file_1 sample_file_2 sample_file_3 ...
-from-kmers
switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If-from-minhash
switch is present, minhashed k-mer files (.minhash) must be generated byminhash
command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying-f
option.database
(output) - file with generated k-mer database,-k <kmer-length>
- length of k-mers (default: 18); ignored when-from-kmers
or-from-minhash
switch is specified,-f <fraction>
- fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when-from-minhash
switch is present,-multisample-fasta
- each sequence in a FASTA file is treated as a separate sample,-extend
- extend the existing database with new samples,-t <threads>
- number of threads (default: number of available cores).
Dense computations - recomended when the distance matrix contains few zeros. Output can be stored in the dense or sparse form (-sparse
switch).
kmer-db all2all [-buffer <size_mb>] [-t <threads>] [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <database> <common_table>
Sparse computations - recommended when the distance matrix contains many zeros. Output matrix is always in the sparse form:
kmer-db all2all-sp [-buffer <size_mb>] [-t <threads>] [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* [-sample-rows [<criterion>:]<count>] <database> <common_table>
Sparse computations, partial databases - use when the distance matrix contains many zeros and there are multiple partial databases. Output matrix is always in the sparse form:
kmer-db all2all-parts [-buffer <size_mb>] [-t <threads>] [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* [-sample-rows [<criterion>:]<count>] <db_list> <common_table>
Parameters:
database
(input) - k-mer database file created bybuild
mode,db_list
(input) - file containing list of databases files created bybuild
mode,common_table
(output) - file containing table with common k-mer counts,-buffer <size_mb>
- size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8,-t <threads>
- number of threads (default: number of available cores),-sparse
- stores output matrix in a sparse form (always on inall2all-sp
andall2all-parts
modes),-min [<criterion>:]<value>
- retains elements withcriterion
greater than or equal tovalue
(see details below),-max [<criterion>:]<value>
- retains elements withcriterion
lower than or equal tovalue
(see details below),-sample-rows [<criterion>:]<count>
- retainscount
elements in every row using one of the strategies: (i) random selection (nocriterion
); (ii) the best elements with respect tocriterion
.
criterion
can be num-kmers
(number of common k-mers) or one of the distance/similarity measures: jaccard
, min
, max
, cosine
, mash
, ani
, ani-shorder
(see 2.3 for definitions). No criterion
indicates num-kmers
(filtering) or random elements selection (sampling). Multiple filters can be combined.
kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-t <threads>] [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <database> <sample_list> <common_table>
Parameters:
database
(input) - k-mer database file created bybuild
mode,sample_list
(input) - file containing list of samples in one of the supported formats (seebuild
mode); if samples are given as genomes (default) or k-mers (-from-kmers
switch), the minhashing is done automatically with the same filter as in the database,common_table
(output) - file containing table with common k-mer counts,-multisample-fasta
/-from-kmers
/-from-minhash
- seebuild
mode for details,-t <threads>
- number of threads (default: number of available cores),-sparse
- stores output matrix in a sparse form,-min [<criterion>:]<value>
- retains elements withcriterion
greater than or equal tovalue
(see details below),-max [<criterion>:]<value>
- retains elements withcriterion
lower than or equal tovalue
(see details below),
criterion
can be num-kmers
(number of common k-mers) or one of the distance/similarity measures: jaccard
, min
, max
, cosine
, mash
, ani
, ani-shorder
(see 2.3 for definitions). No criterion
indicates num-kmers
. Multiple filters can be combined.
kmer-db one2all [-from-kmers | -from-minhash] [-t <threads>] <database> <sample> <common_table>
The meaning of the parameters is the same as in new2all
mode, but instead of specifying file with sample list, a single sample file is used as a query.
Modes all2all
, all2all-sp
, all2all-parts
, new2all
, and one2all
produce a comma-separated table with numbers of common k-mers. For all2all
, new2all
, and one2all
modes, the table is by default stored in a dense form:
kmer-length: k fraction: f | db-samples | s1 | s2 | ... | sn |
query-samples | total-kmers | |s1| | |s2| | ... | |sn| |
q1 | |q1| | |q1 ∩ s1| | |q1 ∩ s2| | ... | |q1 ∩ sn| |
q2 | |q2| | |q2 ∩ s1| | |q2 ∩ s2| | ... | |q2 ∩ sn| |
... | ... | ... | ... | ... | ... |
qm | |qm| | |qm ∩ s1| | |qm ∩ s2| | ... | |qm ∩ sn| |
where:
- k - k-mer length,
- f - minhash fraction (1, when minhashing is disabled),
- s1, s2, ..., sn - database sample names,
- q1, q2, ..., qm - query sample names,
- |a| - number of k-mers in sample a,
- |a ∩ b| - number of k-mers common for samples a and b.
When -sparse
switch is specified or all2all-sp
, all2all-parts
modes are used, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (column_id: value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
kmer-length: k fraction: f | db-samples | s1 | s2 | ... | sn |
query-samples | total-kmers | |s1| | |s2| | ... | |sn| |
q1 | |q1| | i11: |q1 ∩ si11| | i12: |q1 ∩ si12| | ||
q2 | |q2| | i21: |q2 ∩ si21| | i22: |q2 ∩ si22| | i23: |q2 ∩ si23| | |
q2 | |q2| | ||||
... | ... | ... | |||
qm | |qm| | im1: |qm ∩ sim1| |
For performance reasons, all2all
, all2all-sp
, and all2all-parts
modes produce a lower triangular matrix.
kmer-db distance <measure> [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <common_table> <output_table>
Parameters:
-
measure
- names of the similarity/distance measure to be calculated, can be one of the following:-
jaccard
:$J(q,s) = |p \cap q| / |p \cup q|$ , -
min
:$\min(q,s) = |p \cap q| / \min(|p|,|q|)$ , -
max
:$\max(q,s) = |p \cap q| / \max(|p|,|q|)$ , -
cosine
:$\cos(q,s) = |p \cap q| / \sqrt{|p| \cdot |q|}$ , -
mash
(Mash distance):$\textrm{Mash}(q,s) = -\frac{1}{k}ln\frac{2 \cdot J(q,s)}{1 + J(q,s)}$ , -
ani
(average nucleotide identity):$\textrm{ANI}(q,s) = 1 - \textrm{Mash}(p,q)$ , -
ani-shorter
- same asani
but withmin
used instead ofjaccard
.
-
-
common_table
(input) - file containing table with numbers of common k-mers produced byall2all
,new2all
, orone2all
mode (both, dense and sparse matrices are supported), -
output_table
(output) - file containing table with calculated distance measure, -
-phylip-out
- store output distance matrix in a Phylip format, -
-sparse
- outputs a sparse matrix (only for dense input matrices - sparse inputs always produce sparse outputs), -
-min [<criterion>:]<value>
- retains elements withcriterion
greater than or equal tovalue
(see details below), -
-max [<criterion>:]<value>
- retains elements withcriterion
lower than or equal tovalue
(see details below),
criterion
can be num-kmers
(number of common k-mers) or one of the distance/similarity measures: jaccard
, min
, max
, cosine
, mash
, ani
, ani-shorder
(see 2.3 for definitions). If no criterion
is specified, measure
argument is used by default. Multiple filters can be combined.
This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build
, new2all
, or one2all
modes with -from-minhash
switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:
kmer-db minhash [-f <fraction>] [-k <kmer-length>] [-multisample-fasta] <sample_list>
kmer-db minhash -from-kmers [-f <fraction>] <sample_list>
Parameters:
sample_list
(input) - file containing list of samples in one of the supported formats (seebuild
mode),-f <fraction>
- fraction of all k-mers to be accepted by the minhash filter (default: 0.01),-k <kmer-length>
- length of k-mers (default: 18; maximum: 30); ignored when-from-kmers
switch is specified,-multisample-fasta
/-from-kmers
- seebuild
mode for details.
For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.
List of the pathogens investigated in Kmer-db study can be found here