Skip to content

Available features

Fabio Cumbo edited this page May 1, 2023 · 3 revisions

The MetaSBT framework provides a set of subroutines (here called modules) for (i) building a taxonomically-organized database of microbial genomes, (ii) defining boundaries for each of the clusters in the database at different taxonomic levels to explore and define the intra-cluster genetic diversity, (iii) profiling genomes and metagenome-assembled genomes by querying the database and reporting the closest lineage and reference genome, (iv) updating a database with new reference genomes and metagenome-assembled genomes by also defining new and yet-to-be-named clusters, (v) and finally producing a report of the current status of a database.

It follows a detailed explanation of all the available modules.

1. index: building a reference database

The index subroutine allows organise a set of reference genomes from isolate sequencing in folders that reflect their taxonomic classification. It makes use of howdesbt to rapidly index all the genomes at the species level and create a sequence bloom tree for each of the species. Lower taxonomic levels are indexed by building new sequence bloom trees considering only the root nodes of the trees at the upper taxonomic levels.

The following command will trigger the generation of a MetaSBT database with a set of genomes listed in a input table called genomes.tsv:

metasbt index --db-dir ~/myindex \
              --input-list ~/genomes.tsv \
              --extension fna.gz \
              --estimate-filter-size \
              --increase-filter-size 5.0 \
              --estimate-kmer-size \
              --min-kmer-occurrences 2 \
              --dereplicate \
              --similarity 100.0 \
              --completeness 50.0 \
              --contamination 5.0 \
              --parallel 4 \
              --nproc 2 \
              --pplacer-threads 2 \
              --jellyfish-threads 2 \
              --tmp-dir ~/tmp \
              --log ~/index.log
              --cleanup \
              --verbose

Please note that the paths to the input genomes must be listed on the first column of the file specified under --input-list, while the second column must contain the genomes full taxonomic label. Fields under the two columns must be separated by a tab like in the example below:

~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna.gz  k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_020554925.1_ASM2055492v1_genomic.fna.gz  k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_000450545.2_ASM45054v2_genomic.fna.gz    k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000450565.2_ASM45056v2_genomic.fna.gz    k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000014845.1_ASM1484v1_genomic.fna.gz     k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli

Note: you may want to get rid of the taxonomic organization of genomes by specifying the --flat-structure option that will consider all the input genomes together for the generation of a single sequence bloom tree. Please note that the --flat-structure option is not compatible with the update module for updating the database with new genomes. Thus, in case you will need to update the sequence bloom tree, there are no other options than building the database from scratch with the new set of genomes.

Available options

Option Default Mandatory Description
--cluster-prefix MSBT Prefix of clusters numerical identifiers
--completeness 0.0 Minimum completeness percentage allowed for input genomes
--contamination 100.0 Maximum contamination percentage allowed for input genomes
--cleanup False Remove temporary data at the end of the pipeline
--closely-related False For closely related genomes use this flag
--db-dir Database folder path
--dereplicate False Dereplicate input genomes
--estimate-filter-size False Estimate the bloom filter size with ntCard. It automatically overrides the --filter-size option
--estimate-kmer-size False Estimate the optimal kmer size with kitsune. It automatically overrides the --kmer-len option
--extension Specify the input genome files extension. All the input genomes must have the same file extension before running this module. Supported file extension: fa, fa.gz, fasta, fasta.gz, fna, fna.gz
--filter-size Bloom filter size
--flat-structure Organize genomes without any taxonomic organization. This will lead to the creation of a single sequence bloom tree
--help Print the list of arguments and exit
--increase-filter-size 0.0 Increase the estimated filter size by the specified percentage. This is used in conjunction with the --estimate-filter-size argument only. It is highly recommended to increase the filter size by a good percentage in case you are planning to update the index with new genomes
--input-list Path to the input table with a list of genome file paths and an optional column with their taxonomic labels. Please note that the input genome files must be gz compressed with fna extension (i.e.: *.fna.gz)
--jellyfish-threads 1 Maximum number of threads for Jellyfish. This is required to maximise the kitsune performances
--kmer-len This is the length of the kmers used for building bloom filters
--limit-estimation-number Limit the number of genomes per group to be considered as input for kitsune and ntCard. It overrides --limit-estimation-percentage in case of a number > 0
--limit-estimation-percentage 100.0 Percentage on the total number of genomes per group to be considered as input for kitsune and ntCard
--limit-genomes Inf Limit the number of genomes per species. This will remove the exceeding number of genomes randomly to cut the overall number of genomes per species to this number
--limit-kmer-size 32 Limit the estimation of the optimal kmer size with kitsune to this value at most
--log Path to the log file
--max-genomes Inf Consider species with this number of genomes at most
--min-genomes 1 Consider species with a minimum number of genomes
--min-kmer-occurrences 2 Minimum number of occurrences of kmers to be considered for estimating the bloom filter size and for building the bloom filter files
--nproc 1 This argument refers to the number of processors used for parallelizing the pipeline when possible
--parallel 1 Maximum number of processors to process each NCBI tax ID in parallel
--pplacer-threads 1 Maximum number of threads for pplacer. This is required to maximise the CheckM performances
--similarity 100.0 Dereplicate genomes if they have a percentage of common kmers greater than or equals to the specified one. This is used exclusively in conjunction with the --dereplicate argument
--tmp-dir Path to the temporary folder
--verbose False Print results on screen
--version Print current module version and exit

2. boundaries: defining clusters boundaries

The boundaries module is crucial for the definition of taxonomy-specific boundaries. It explots howdesbt to build a table with the minimum and maximum amount of kmers in common between all the genomes in a particular taxonomic level. These boundaries are then used by the update module in order to establish whether a new genome must be assigned to the closest cluster identified by the profile module.

The following command will trigger the definition of the kmer boundaries for each taxonomic level in the database:

metasbt boundaries --db-dir ~/myindex \
                   --min-genomes 50 \
                   --output ~/boundaries.tsv \
                   --tmp-dir ~/tmp \
                   --nproc 4 \
                   --cleanup \
                   --verbose

Note: the boundaries module considers clusters with reference genomes only. These clusters can be considered for establishing boundaries depending on a minimum number of reference genomes that can be set with the --min-genomes argument.

Available options

Option Default Mandatory Description
--cleanup False Remove temporary data at the end of the pipeline
--db-dir Database directory with the taxonomically organised sequence bloom trees
--flat-structure False Genomes in the database have been organized without a taxonomic structure
--help Print the list of arguments and exit
--superkingdom Consider genomes whose lineage belongs to a specific superkingdom
--log Path to the log file
--max-genomes Maximum number of genomes per cluster to be considered for computing boundaries. Genomes are selected randomly in case the size of clusters is greater than this number. This must be greater than or equals to --min-genomes
--min-genomes 3 Consider clusters with a minimum number of genomes only
--nproc 1 This argument refers to the number of processors used for parallelizing the pipeline when possible
--output Output file with kmer boundaries for each of the taxonomic labels in the database
--tmp-dir Path to the temporary folder
--verbose False Print results on screen
--version Print current module version and exit

3. profile: characterizing genomes and metagenome-assembled genomes

The profile module allows to characterize an input genome according to the closest lineage in the database. It allows to process only one genome in input at a time:

metasbt profile --input-file ~/mymag.fna \
                --input-id mymag \
                --input-type genome \
                --tree ~/myindex/k__Bacteria/index.detbrief.sbt \
                --threshold 0.7 \
                --expand \
                --stop-at family \
                --output-dir ~/profiles \
                --output-prefix mymag \
                --verbose

Note: with the example proposed above, we explicitly set the --stop-at argument to family. This argument works in conjunction with the --expand option only, and it will prevent epanding the query to all the taxonomic levels lower than the specified one. Also note that the --expand argument expands the input query up to the species level by default, by also reporting the closest genome, without the need to use the --stop-at argument.

Available options

Option Default Mandatory Description
--expand False Expand the input query on all the taxonomic levels
--input-file Path to the input query
--input-id Unique identifier of the input query
--input-type Accepted input types are genomes or files with one nucleotide sequence per line. In case of genomes, if they contain multiple sequences, results are merged together
--log Path to the log file
--output-dir Output folders with queries results
--output-prefix Prefix of the output files with query matches
--stop-at Stop expanding queries at a specific taxonomic level. Please note that this argument works in conjunction with --expand only
--threshold 0.0 Fraction of query kmers that must be present in a leaf to be considered a match
--tree This is the tree definition file
--verbose False Print results on screen
--version Print current module version and exit

4. update: updating the database

This module can be used to add new reference genomes and metagenome-assembled genomes (MAGs) to the database generated with the index module.

In case of new MAGs, it first try to profile them by comparing the input genomes with those present in the database. An input genome is assigned to the closest genome in the database if their set of kmers result similar enough. In case the profiler will not be able to characterize the input genome, the update subroutine will exploit howdesbt to automatically group the input genomes according to the taxonomy-specific boundaries identified with the boundaries utility, and it finally generate new clusters of potentially novel and yet-to-be-named species.

In case of new reference genomes from isolate sequencing, the update module simply add the new genomes to the corresponding species by rebuilding the species trees and all the trees at the lower taxonomic levels (please note that the taxonomic labels of the input reference genomes are known). If the taxonomy of an input reference genome is not in the database, the module will compare the input genome with all the cluster of species with no references before generating a new branch in the index database. If the input genome results very close to a cluster of unknown genomes, its taxonomic label will be inherited by all the genomes in the unknown cluster.

The update module can be launched with the following command:

metasbt update --input-list ~/mygenomes.txt \
               --taxa ~/taxonomies.tsv \
               --boundaries ~/boundaries.tsv \
               --db-dir ~/myindex \
               --tmp-dir ~/tmp \
               --dereplicate \
               --similarity 100.0 \
               --completeness 50.0 \
               --contamination 5.0 \
               --type references \
               --extension fna.gz \
               --parallel 4 \
               --nproc 2 \
               --pplacer-threads 2 \
               --cleanup \
               --verbose

Please note that MetaSBT requires that all your input genomes have the same format and extension before running the pipeline. You can easily uniform your genome files extension by running the uniform_inputs.sh script under the scripts folder of this repository:

sh ./scripts/uniform_inputs.sh ~/mygenomes fa fna

The first argument is the path to the folder with the set of input genome files, while the second and third arguments are the current and new file extensions respectively. In this particular case, all the *.fa files are converted to *.fna files.

In case of files with multiple extensions, please run the same script as many times as you need to uniform all the input genome files extensions, until the number of files with the new extension matches the actual number of files in the input folder:

sh ./scripts/uniform_inputs.sh ~/mygenomes fa fna
sh ./scripts/uniform_inputs.sh ~/mygenomes fasta fna
sh ./scripts/uniform_inputs.sh ~/mygenomes fa.gz fna.gz

Available options

Option Default Mandatory Description
--boundaries Path to the output table produced by the boundaries module
--boundary-uncertainty 0.0 Define the percentage of kmers to enlarge and reduce boundaries
--cluster-prefix MSBT Prefix of clusters numerical identifiers
--completeness 0.0 Minimum completeness percentage allowed for input genomes
--contamination 100.0 Maximum contamination percentage allowed for input genomes
--cleanup False Remove temporary data at the end of the pipeline
--db-dir Database folder path
--dereplicate False Dereplicate input genomes
--extension Specify the input genome files extension. All the input genomes must have the same file extension before running this module
--input-list This file contains the list of paths to the new genomes that will be added to the database
--log Path to the log file
--nproc 1 This argument refers to the number of processors used for parallelizing the pipeline when possible
--parallel 1 Maximum number of processors to process each NCBI tax ID in parallel
--pplacer-threads 1 Maximum number of threads for pplacer. This is required to maximise the CheckM performances
--similarity 100.0 Dereplicate genomes if they have a percentage of common kmers greater than or equals to the specified one. This is used exclusively in conjunction with the --dereplicate argument
--taxa Input file with the mapping between input genome IDs and their taxonomic label. This is used in case of reference genomes only
--tmp-dir Path to the temporary folder
--type Define the nature of the input genomes
--verbose False Print results on screen
--version Print current module version and exit

5. report: building the database snapshot report

Once the database is built and updated with new MAGs and reference genomes, you can easily extract relevant information about all the species in your database by running the following command:

metasbt report --db-dir ~/myindex \
               --output-file ~/report.tsv

The output file is a table that will contain the number of MAGs and reference genomes, in addition to the mean completeness, contamination, and strain heterogeneity percentages for each lineage in the database. Please note that lineages with no reference genomes correspond to newly defined clusters and potentially new and still-to-be-named species.

Available options

Option Mandatory Description
--db-dir Database folder path
--output-file Path to the output table
--version Print current module version and exit