Skip to content
Mikkel Winther Pedersen edited this page Oct 3, 2023 · 13 revisions

Taxonomic classification and profiling

activate conda environment and go to mapping folder

conda activate day1

cd ~/course/wdir/mapping

We can now explore the alignment files with samtools. Remember you can use the view function in explore the bam/sam file. try to specify the -h and -H options. Notice that samtools log commands at the bottom of the -H (print header only) option. Also, explore a bit the bam file format, and see if you can recognize any of the information in there? You can also try out the -c option which counts the number of alignments in your bamfile.

samtools view PRI-TJPGK-CATN-96-98.bam | head -40

We can now parse all our alignments for each read to a taxonomic classifier in this case we use ngsLCA. But before this can be done ngsLCA assumes that the bam/sam files are sorted by readID. This can be done with the samtools suite more specifically the samtools sort command with the option -n.

samtools sort -n PRI-TJPGK-CATN-96-98.bam -@ 5 > PRI-TJPGK-CATN-96-98.sort.bam

ngsLCA a naive least common ancestor algorithm

The ngsLCA (next generation sequence Lowest Common Ancestor algorithm) toolkit includes two modules, ngsLCA main program and ngsLCA R package. The ngsLCA main program provides a fast and flexible taxonomic classification of DNA reads aligned to reference databases containing multiple organisms. The classification builds upon the NCBI taxonomy and performs a naïve lowest common ancestor assignment for reads with multiple alignments against different references. The command line tool outputs a text file in "lca" format. Normally we would download the NCBI taxonomy dump files simultaneously with the databases e.g. RefSeq https://www.ncbi.nlm.nih.gov/refseq/ and nt to ensure overlap between versions. You can explore this after the course through guide https://github.com/miwipe/ngsLCA. In this tutorial we will use a subset of this database combined with a subset of microbial genomes from GTDB https://gtdb.ecogenomic.org/ that has been made compatible with the ncbi taxonomy. It is possible to use a set of custom reference genomes, but one needs to ensure that the accession number in the fasta header of the reference is linking to a taxid and so forth. and that all confines to the NCBI taxonomy format.

conda activate day2

Now let's call on ngsLCA

ngsLCA

this should results in ngsLCA print the version number and the command with options. Like:

-> ngslca version: 0.9 (htslib: 1.17) build(May 18 2023 23:42:00) -> ./ngsLCA -names -nodes -acc2tax [-editdist[min/max] -simscore[low/high] -minmapq -discard] -bam

OBS., Please note that the similarity can be set as an edit distance [-editdist[min/max]], i.e., number of mismatches between the read to reference genome, or as a similarity distance [-simscore[low/high]], i.e., percentage of mismatches between the read to reference genome. Edit distance can be a number between 0-10, while the similarity score is a number between 0-1.

Please discuss in your group what could be a good set of options and note down why you have chosen these. Hint: the mapping quality of the

The ngsLCA algorithm has been implemented and is now continued inside another program we call metaDMG, so lets go to the environment

conda activate metaDMG
metaDMG-cpp lca

Now this should print near to identical command options as the ngsLCA command. If this works let's set metaDMG lca to run.

metaDMG-cpp lca -bam PRI-TJPGK-CATN-96-98.sort.bam -names ~/course/data/shared/mapping/taxonomy/names.dmp -nodes ~/course/data/shared/mapping/taxonomy/nodes.dmp -acc2tax ~/course/data/shared/mapping/taxonomy/acc2taxid.map.gz -weighttype 1 -fix_ncbi 0 -out PRI-TJPGK-CATN-96-98

Now make a list sorted by time and reversed to print the list of latest files created in the bottom.

ls -ltrh

If ngsCLA has finished its run you will have a .lca and .log (and .bin which is a binary look up file) the two first files are the per-read taxonomic classification and the log which outputs error messages if any. The PRI-TJPGK-CATN-96-98.bdamage.gz and the PRI-TJPGK-CATN-96-98.stat are files created for the DNA damage estimation, and are not part of the ngsLCA output, but are generated during the metaDMG implementation of ngsLCA as we were going through the bam file anyways. So let's look at the log first.

more PRI-TJPGK-CATN-96-98.log 

If there were issues with finding accession numbers, from the reference genomes, that would be printed as an error here. And the read and its alignments will be ignored.

Now lets have a look at your taxonomic profile for the sample, but zcat'ing the lca output.

zcat PRI-TJPGK-CATN-96-98.lca.gz | head -40

This output contains per-read classification and each line is a unique read with its own classification. You can have a look at the .lca file format here. Now this is simple text manipulation that will allow you to parse the profile and to further plot this in R or however you usually visualize your data.

Try cutting out the species names

zcat PRI-TJPGK-CATN-96-98.lca.gz | cut -f 14 -d: | head -40

Try cutting out the taxonomic names of lowest taxonomic level each read has been classified to, sort them and count how many reads was classified to each.

zcat PRI-TJPGK-CATN-96-98.lca.gz | cut -f 14 -d: | sort | uniq -c

Do you know any of these? Let's check the rose family "Rosaceae", using grep.

zcat PRI-TJPGK-CATN-96-98.lca.gz | grep 'Rosaceae' | head -10

Note, that the reads here are classified at different taxonomic levels. We can also grep out the Ursidae family.

zcat PRI-TJPGK-CATN-96-98.lca.gz | grep 'Ursidae' | cut -f 14 -d: | sort | uniq -c

Do you know this family? If not perhaps try and google it? You can also use other resources like gbif.org.

Now we don't know if all these taxa are ancient or not, just that the reads are unique to the given level, within the similarity thresholds set. But to do the mapDamage exercise for each of these 505 species in this sample (and that is a low count compared to what you can classify in a given sample) would take a lot of time, first finding, downloading, indexing, mapping the uniquely classified reads to each corresponding genomes will take a looooooooong time. And then you will need to do this for all your samples, just to filter away taxa that show now damage, DNA fragmentation or have too few reads for this ancient DNA authentication.

Plotting general statistics on the nature of the data from the lca output in R.

Activate your R environment, OBS! if you want to be able to view the output you should be using the IDE and not the terminal only.

conda activate r 

First we create and add header to a new file

echo "ID,Readlength,Count" > long_table.csv

Then we can loop over the lca output and plot the readlength distribution using ggplot2.

for file in *lca.gz
do
    # Extract the ID from the input filename
    id=$(basename "$file" .lca.gz)

    # Sort and count the unique numbers in the file, skip first line, and append to long_table.csv
    zcat "$file" | cut -d':' -f9 | sort | uniq -c | sed 1d | awk -v id="$id" '{print id "," $2 "," $1}' >> long_table.csv

    # Generate a scatterplot of the data using ggplot2
    Rscript -e "library(ggplot2); data <- read.csv('long_table.csv'); myplot <- ggplot(data, aes(x=Readlength, y=Count)) + geom_point() + facet_wrap(~ID, ncol = 2, scales = 'free_y'); ggsave('readlength_distributionTotal.pdf', myplot, width = 6, height = 4)"
done

This can also be done per superkingdom or which taxonomic level you want. (OBS will take 5 minutes or so)

echo "ID,Category,Readlength,Count" > long_table2.csv

for file in *lca.gz
do
    # Extract the ID from the input filename
    id=$(basename "$file" .lca.gz)

    # Sort and count the unique numbers in the file, skip first line, and append to long_table.csv
    zcat "$file"  | grep -e 'Archaea' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Archaea" "," $2 "," $1 }' >> Archaea_long_table.csv
    zcat "$file"  | grep -e 'Bacteria' -e 'bacteria' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Bacteria" "," $2 "," $1 }' >> Bacteria_long_table.csv
    zcat "$file"  | grep -e 'Viridiplantae' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Viridiplantae" "," $2 "," $1 }' >> Viridiplantae_long_table.csv
    zcat "$file"  | grep -e 'Virus' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Virus" "," $2 "," $1 }' >> Virus_long_table.csv
    zcat "$file"  | grep -e 'Metazoa' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Metazoa" "," $2 "," $1 }' >> Metazoa_long_table.csv
    # Create a long table of all long tables
    cat Archaea_long_table.csv Bacteria_long_table.csv Viridiplantae_long_table.csv Virus_long_table.csv Metazoa_long_table.csv >> long_table2.csv

    # Generate a scatterplot of the data using ggplot2
    Rscript -e "library(ggplot2);library(viridisLite); data <- read.csv('long_table2.csv'); plot <- ggplot(data, aes(x = Readlength, y = Count, fill = Category)) + geom_area(alpha = 0.8) + scale_fill_viridis_d() + facet_wrap(~ID, scales = 'free_y', ncol = 2) + labs(x = 'Read Length', y = 'Count', title = 'Read Length Distribution by Category') + theme(plot.title = element_text(hjust = 0.5)); ggsave('readlength_distributionPerKingdom.pdf', plot, width = 5, height = 7)"
done

metaDMG

Now we will compute DNA damage and statistics using metaDMG.

conda activate metaDMG

Create configuration file = config.yaml which defines the samples and taxonomy used as well as the LCA and damage calculation settings (see https://github.com/miwipe/ngsLCA for details) OBS input files can be in .sam, .sam.gz and bam file formats (they are the same just different compressions). The taxonomy should be identical to the version of database you use, it can be NCBI or custom.

metaDMG config *.sort.bam --names ~/course/data/shared/mapping/taxonomy/names.dmp --nodes ~/course/data/shared/mapping/taxonomy/nodes.dmp --acc2tax ~/course/data/shared/mapping/taxonomy/acc2taxid.map.gz -m /usr/local/bin/metaDMG-cpp

Edit the config file according to wishes, study and data. set custom database = true, set number of threads to 4.

vim config.yaml

Then compute the statistics

metaDMG compute config.yaml 

And now it is time to explore the data, first using the GUI

metaDMG dashboard config.yaml

You should now be prompted to open a new browser window, press ok to the TRUST message, and open. Then we go for an interactive exploration of the dataset.

Once done browsing the GUI, press control + c to exit

Now when datasets grow to a larger size and you want to handle it in R or like you can convert the output in the /data/results/ folder to a merged csv file.

metaDMG convert --output path2/results.csv --add-fit-predictions