Kmer-db is a fast and memory-efficient tool for estimating evolutionary distances.
Kmer-db comes with a set of precompiled binaries for Windows and Linux. The software can be also built from the sources distributed as:
- MAKE project (G++ 4.8 required) for Linux and OS X.
- Visual Studio 2015 solution for Windows,
Kmer-db uses zlib for handling gzipped inputs. Under Linux, the software is by default linked against system-installed zlib. Due to issues with some library versions, precompiled zlib is also present the repository. In order to use it, one needs to modify variable INTERNAL_ZLIB at the top of the makefile. Under Windows, the repository library is always used.
Kmer-db by default takes advantage of AVX (required) and AVX2 (optional) CPU extensions. Pre-built binary detetermines supported instructions at runtime, thus it is multiplatform. However, one may encounter a problem when building Kmer-db a CPU without AVX2. To prevent from using AVX2, the program must be compiled with NO_AVX2 symbolic constant defined. When building under Linux or OS X, there is a NO_AVX2 switch at the top of the makefile which does the job.
kmer-db <mode> [options] <positional arguments>
Kmer-db operates in one of the following modes:
build
- building a database from samples,all2all
- counting common k-mers - all samples in the database,new2all
- counting common k-mers - set of new samples versus database,one2all
- counting common k-mers - single sample versus database,distance
- calculating similarities/distances,minhash
- storing minhashed k-mers,
Common options:
-t <threads>
- number of threads (default: number of available cores),
The meaning of other options and positional arguments depends on the selected mode.
Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:
-
compressed or uncompressed genomes:
kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] <sample_list> <database>
-
KMC-generated k-mers:
kmer-db build -from-kmers [-f <fraction>] <sample_list> <database>
-
minhashed k-mers produced by
minhash
mode:kmer-db build -from-minhash <sample_list> <database>
Parameters:
sample_list
(input) - file containing list of samples in the following format:By default, the tool requires compressed (.gz/.fna.gz/.fasta.gz) or uncompressed (.fna/.fasta) genome files for each sample (extensions are added automatically). Whensample1 sample2 sample3 ...
-from-kmers
switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If-from-minhash
switch is present, minhashed k-mer files (.minhash) must be generated byminhash
command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying-f
option.database
(output) - file with generated k-mer database.-k <kmer-length>
- length of k-mers (default: 18); ignored when-from-kmers
or-from-minhash
switch is specified.-f <fraction>
- fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when-from-minhash
switch is present.-multisample-fasta
- each sequence in a genome FASTA file is treated as a separate sample.
kmer-db all2all [-buffer <size_mb>] <database> <common_table>
Parameters:
database
(input) - k-mer database file created bybuild
mode,common_table
(output) - file containing table with common k-mer counts.-buffer <size_mb>
- size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD to maximize performance; default: 8
kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] <database> <sample_list> <common_table>
Parameters:
database
(input) - k-mer database file created bybuild
mode.sample_list
(input) - file containing list of samples in one of the supported formats (seebuild
mode); if samples are given as genomes (default) or k-mers (-from-kmers
switch), the minhashing is done automatically with the same filter as in the database.common_table
(output) - file containing table with common k-mer counts.-multisample-fasta
/-from-kmers
/-from-minhash
- seebuild
mode for details.
kmer-db one2all [-multisample-fasta|-from-kmers|-from-minhash] <database> <sample> <common_table>
The meaning of the parameters is the same as in new2all
mode, but instead of specifying file with sample list, a single sample file is used as a query.
Modes all2all
, new2all
, and one2all
produce a comma-separated table with number of common k-mers. The table is in the following form:
kmer-length: k fraction: f | db-samples | s1 | s2 | ... | sn |
query-samples | total-kmers | |s1| | |s2| | ... | |sn| |
q1 | |q1| | |q1 ∩ s1| | |q1 ∩ s2| | ... | |q1 ∩ sn| |
q2 | |q2| | |q2 ∩ s1| | |q2 ∩ s2| | ... | |q2 ∩ sn| |
... | ... | ... | ... | ... | ... |
qm | |qm| | |qm ∩ s1| | |qm ∩ s2| | ... | |qm ∩ sn| |
where:
- k - k-mer length,
- f - minhash fraction (1, when minhashing is disabled),
- s1, s2, ..., sn - database sample names,
- q1, q2, ..., qm - query sample names,
- |a| - number of k-mers in sample a,
- |a ∩ b| - number of k-mers common for samples a and b.
For performance reasons, all2all
mode produces a lower triangular matrix.
kmer-db distance [<measures>] <common_table>
Parameters:
common_table
(input) - file containing table with numbers of common k-mers produced byall2all
,new2all
, orone2all
mode.measures
- names of the similarity/distance measures to be calculated, can be one or several of the following:jaccard
,min
,max
,cosine
,mash
. If measures are not specified,jaccard
is used by default.
This mode generates a file with similarity/distance table for each selected measure. Name of the output file is produced by adding to the input file an extension with a measure name.
This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build
, new2all
, or one2all
modes with -from-minhash
switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:
kmer-db minhash [-k <kmer-length>] [-multisample-fasta] <fraction> <sample_list>
kmer-db minhash -from-kmers <fraction> <sample_list>
Parameters:
fraction
(input) - fraction of all k-mers to be accepted by the minhash filter.sample_list
(input) - file containing list of samples in one of the supported formats (seebuild
mode).-k <kmer-length>
- length of k-mers (default: 18); ignored when-from-kmers
switch is specified.-multisample-fasta
/-from-kmers
- seebuild
mode for details.
For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.
Let pathogens.list be the file containing names of samples (there exist .gz or .fasta genome file for each sample):
acinetobacter
klebsiella
e.coli
...
Calculating similarities/distances between all samples listed in pathogens.list using all 20-mers:
kmer-db build -k 20 pathogens.list pathogens.db
kmer-db all2all pathogens.db matrix.csv
kmer-db distance matrix.csv
Same as above, but using only 10% of 20-mers:
kmer-db build -k 20 -f 0.1 pathogens.list pathogens.db
kmer-db all2all pathogens.db matrix.csv
kmer-db distance matrix.csv
Calculating similarities/distances between samples listed in pathogens.list and salmonella using all 20-mers:
kmer-db build -k 20 pathogens.list pathogens.db
kmer-db one2all pathogens.db salmonella vector.csv
kmer-db distance vector.csv
Same as above, but using only 10% of 20-mers:
kmer-db build -k 20 -f 0.1 pathogens.list pathogens.db
kmer-db one2all pathogens.db salmonella vector.csv
kmer-db distance vector.csv
List of the pathogens investigated in Kmer-db study can be found here