-
Notifications
You must be signed in to change notification settings - Fork 87
Home
### RaPDTool is an experimental tool to carry out various bioinformatic tasks in a simple way, namely:
** Taxonomic profiling**
** Metagenomic binning**
** Estimation of Completeness and Redundancy**
** Estimation of taxonomic neighborhoods on individual genomic composites**
All you need to begin whith RaPDTool is a metegenomic assembly and a reference database (-i and -d options)
The reference database can be downloaded from:
NCBI Prokaryotic type material genomes with standing in nomenclature: https://figshare.com/ndownloader/files/30851626
GTDB r202: https://figshare.com/ndownloader/files/30863182
Use one of these databases depending on your goals. NCBI type material contains a more curated but less diverse taxonomic corpus. GTDB proposes an alternative but interesting taxonomy especially for new genomes.
Make sure you have all dependencies installed and functional. The best way to do this is under a conda environment, type:
$ conda install focus metabat2 binning_refiner mash
$ pip install micomplete
or by installing the dependencies individually, whose information can be found on dedicated websites. Please see more information on:
FOCUS (https://github.com/metageni/FOCUS)
Metabat2 (https://bitbucket.org/berkeleylab/metabat/src/master/) (version tested 2:2.15)
Binning_refiner (https://github.com/songweizhi/Binning_refiner)
miComplete (https://github.com/EricHugo/miComplete)
Mash (https://github.com/marbl/Mash)
Clone this repository to the directory of your choice:
$ git clone https://github.com/ayixon/RaPDTool.git
Copy the input files (assembly and database) to the RaPDTool/ directory and move into it with the cd command
./rapdtool.py -i INPUT.fasta -d DATABASE.msh -r OUTPUT_FOLDER
The first time the software is executed, some R and binning_refiner dependencies are installed which may take a while. This only happens once.
The output of RaPDTool produces 8 main directories:
Contains the log file of the RaPDTool execution (logfmbm.txt).
fmbm is a kind of acronym that includes the main operations of the pipeline (Focus/Metabat/Binning_refiner/Mash).
Contains the reference database used for running RaPDTool.
Contains the assembly used for running RaPDTool.
Store the FOCUS taxonomic profile inferred from the inputs (metagenome assembly). You should see several files -in tabular format (csv)- reporting relative abundance from Kingdom to Species . FOCUS also ventures to infer Strains, but I would be cautious at that taxonomic level.
1-We could assume that the short-reads contain a "genomic space" more representative of the community, than that contained in the assembly; the assembly per se supposes a loss of taxonomic information. Assembled contigs profiling only represents an approximation of taxonomic composition at the genomic level, so be cautious with the interpretations.
2-The native FOCUS database plays an important role in the accuracy of the profile. The initial launch of FOCUS considered 2,766 reference genomes to build a kmer frecuencies database ( k = 6; k = 7) . For the implementation of RaPDTool, we have considered 14,551 genomes from the Type Material to give taxonomic certainty to the profiles, while enriching the initial database. The new k = 6; k = 7 kmer archives for updating FOCUS database will be available on: https://drive.google.com/uc?export=download&id=1AOOwhmhg9Zn5iYrOs9j36cBZZTIupPbC
Contains several relevant subdirectories and files:
binmetabat/ > Store Metabat2 binning results. The genome composites aggregated from the initial metagenomic assembly
outbinningref/ > Binning_refiner results. All bins obtained with Metabat2 are "refined" with Binning_refiner to produce a set of probable MAGs
outmicomplete/ > Hugoson et al, 2020 published a paper with a fairly "generous" alternative to estimate quality of assembled microbial genomes (https://doi.org/10.1093/bioinformatics/btz664). Although the gold standard is still CheckM, miComplete is more resource friendly and offers a weighted calculation.
The result of miComplete is a table with the quality assessment of the refined bins as shown in the image:
outmash/ > Full Mash dist comparison for each bin produced, against the input database. Remember that these databases contain a set of genomes curated as Mash representations or sketches. This indicates that bin1 is compared against the ~17,000 records in the database (that's extremely fast with Mash), and the result is a table with 5 columns representing the following:
Query_genome | Match_in_database | Genomic_Distance | p_value | Shared_Hashes |
---|---|---|---|---|
Bin1.fna | GCA_Reference.fna | 0.0327655 | 0 | 471/1000 |
The genomic distance in the third column refers to the Mash distance, also defined as mutational distance. You will find more information on the interpretation of these tables in: https://doi.org/10.1186/s13059-016-0997-x. A practical interpretation of this comparison suggests that if two genomic contexts share < 0.05 distance, they are likely to be genomically coherents, and that has implications for the prokaryotic species concept. This also means that those contexts with smaller genomic distances will potentially be the closest phylogenetic neighbors to your query; very useful if you want to explore the phylogenetic hypothesis.
Other subdirectories contain the log files of each task
Contain the ten closest hits from the Mash paired comparison for each genome. This simplifies the interpretation of the results by limiting the Mash comparison to the ten closest neighbors to the query, which can be useful in phylogenetics and taxonomy. The user can take this list as the basis for a finer comparison by estimating the Overall genome relatedness index (OGRI) like ANI...
As you can see, they are conveniently sorted from smallest to largest, so that it is easy to establish or rule out probable genomic coherence; and use the elements of the reference in subsequent more refined analyzes.
For example, in the previous image the bin feces_assembly_1.fasta shares a genomic distance of ~0.075 with the assembly GCF_003287895.1, that belongs to the species Blautia argi (firmicutes); and ~0.095 with the assembly GCF_002222595.2 that belongs to the species Blautia hansenii . Other hits in this comparison also match elements of the Blautia genus. It is not difficult to hypothesize that the bin feces_assembly_1.fasta is related with the clade Blautia (probably at the genus level, although nothing can be said about the species yet). So, presumably feces_assembly_1.fasta can be clasified as Blautia sp.
Potential tests could be the estimation of the Average Nucleotide Identity against these close hits and reconstructing a phylogenomic tree in order to place the query in a finer taxonomic context.