Skip to content
Mikkel Winther Pedersen edited this page Oct 1, 2023 · 13 revisions

Taxonomic classification or profiling

activate conda environment

conda activate XXXX

ngsLCA which we will use to classify all reads to a taxonomic level assumes that the bam/sam files are sorted by readID. This can be done with the samtools suite more specifically the samtools sort command.

samtools sort -n $file -@ 5 > test.sort.bam

Remember you can use the view function in samtools to view the bam/sam file. try to specify the -h and -H options. Notice that samtools log the commands in the bottom of the -H (print header only) option. Also, explore a bit the bam file format, and see if you can recognize any of the information in there. You can also try out the -c option which counts the number of alignments in your bamfile.

ngsLCA a naive least common ancestor algorithm

The ngsLCA (next generation sequence Lowest Common Ancestor algorithm) toolkit includes two modules, ngsLCA main program and ngsLCA R package. The ngsLCA main program provides a fast and flexible taxonomic classification of DNA reads aligned to reference databases containing multiple organisms. The classification builds upon the NCBI taxonomy and performs a naïve lowest common ancestor assignment for reads with multiple alignments against different references. It is a command line tool that outputs a text file in "lca" format. Normally we would download the NCBI taxonomy dump files simultaneously with the databases e.g. RefSeq https://www.ncbi.nlm.nih.gov/refseq/ and nt to ensure overlap between versions. You can explore this after the course through guide https://github.com/miwipe/ngsLCA. In this tutorial we will use a subset of this database combined with a subset of microbial genomes from GTDB https://gtdb.ecogenomic.org/ that has been made compatible with the ncbi taxonomy. It is possible to use a set of custom reference genomes, but one needs to ensure that the accession number in the fasta header of the reference is linking to a taxid and so forth. and that all confines to the NCBI taxonomy format.

Now lets call on ngsLCA

ngsLCA

this should results in ngsLCA print the version number and the command with options. Like:

-> ngslca version: 0.9 (htslib: 1.17) build(May 18 2023 23:42:00) -> ./ngsLCA -names -nodes -acc2tax [-editdist[min/max] -simscore[low/high] -minmapq -discard] -bam

OBS., Please note that the similarity can be set as an edit distance [-editdist[min/max]], i.e., number of mismatches between the read to reference genome, or as a similarity distance [-simscore[low/high]], i.e., percentage of mismatches between the read to reference genome. Edit distance can be a number between 0-10, while the similarity score is a number between 0-1.

Please discuss in your group what could be a good set of options and note down why you have chosen these. Hint: the mapping quality of the

ngsLCA -names XXX -nodes XXX -acc2tax XXX [-editdist[min/max] / -simscore[low/high] -minmapq -discard] -bam test.sort.bam

Data wrangling

At this point the data analysis moves more and more towards data wrangling (handling) in R.

The ngsLCA R package provides functionality for processing "lca" files and illustrating the taxonomic profiles. It supplies R functions for quick transformation of the "lca" files to tables in different formats after filtering, e.g., a regular tab-separated table and MEGAN compatible formats, for de-contamination if laboratory controls have been sequenced and are provided, and for splitting taxa into different kingdoms (or user-defined taxonomic groups) and taxonomic ranks. Functions are also provided for outputting heatmaps, barplots and stratplots as well as NMDS and rarefaction analysis for a quick overview of the dataset.

Therefore activate your R conda environment

conda activate r

and open the r prompt by typing and hitting enter

R

And define your work directory and name of the run:

library(ngsLCA)
working_directory="./" #change it to the path of your lca files folder.
run_name="run01"

To generate a complete taxa profile by combining all input lca files (this step is required for running all other functions). It will generate a tab separated text file (“complete_profile.txt”) into the folder named after “run_name” in the working directory, in which the first 3 columns are NCBI taxaID, taxaname, and taxa rank. Each of the following columns shows the reads abundance of an input .lca file. You can type ?ngsLCA_profile to enter the help page for ngsLCA_profile to access an example sample metadata file.

ngsLCA_profile(path=working_directory,
               run=run_name,
               metadata="Desktop/project_files/sample_metadata.txt") #change it to the full path of your sample metadata.