Skip to content
Mikkel Winther Pedersen edited this page Oct 9, 2024 · 21 revisions

Taxonomic classification and profiling

activate conda environment and go to mapping folder

conda activate env1

cd ~/course/wdir/mapping

We can now explore the alignment files with samtools. Remember you can use the view function in explore the bam/sam file. try to specify the -h and -H options. Notice that samtools log commands at the bottom of the -H (print header only) option. Also, explore a bit the bam file format, and see if you can recognize any of the information in there? You can also try out the -c option which counts the number of alignments in your bamfile.

samtools view PRI-TJPGK-CATN-26-28.bam | head -40

samtools view PRI-TJPGK-CATN-26-28.bam | head -40 | awk '{print $10}'

samtools view PRI-TJPGK-CATN-26-28.bam | head -40 | awk '{print $18}'

samtools view PRI-TJPGK-CATN-26-28.bam | head -40 | awk '{split($18, a, ":"); print a[3]}'

samtools view PRI-TJPGK-CATN-26-28.bam | head -40 | awk '{split($18, a, ":"); if (length($10) > 0) print ((length($10)-a[3]) / length($10)) * 100 }'

What does the above commands do?

Bam-filter - summarizing bam-filter statistics

One of the more recent tools that have been developed are bam-filter. As the name indicates it is a tool that is designed to filter bam files based on the different statistics. But before we dig into it, lets have a look at the github repo and familiarize ourselves with its capabilities, especially the BAMfilter filter stats. (OBS the reassign and lca options are still being tested especially for eukaryotic data).

Please discuss in your group; (i) what options were flagged in the command above? Try and identify options that could be useful to set for obtaining a bam file with reads to references that have higher likelihood of being true?

ngsDMG or metaDMG

Once done with the bam-filter, lets run metaDMG (ngsDMG) before we dig into the results of the bam-filter output. So we run a loop over all files, that takes the files through first the lca algorithm, then fitting the damage using dfit and the merging the two output (aggregate) thus we end up with a taxa, with number of reads uniquely classified and the connected statistics such as, damage, gc, readlength etc. More on this later, we set this running so it can finish in the background while we take a look at the bam-filter stats file.

First attach to the screen from yesterday, or alternatively make a new (see day1).

tmux --attach

Run the metaDMG-cpp commands

conda activate metaDMG

while read -r line 
do 
arr=($line)
sample=${arr[0]}
metaDMG-cpp lca --names ~/course/data/day4/taxonomy/names.dmp --nodes ~/course/data/day4/taxonomy/nodes.dmp --acc2tax ~/course/data/day1/taxonomy/acc2taxid.map2.gz --sim_score_low 0.95 --sim_score_high 1.0 --how_many 30 --weight_type 1 --fix_ncbi 0 --threads 4 --bam $sample.filtered.bam --out_prefix $sample.filtered
metaDMG-cpp dfit {}.filtered.bdamage.gz --threads 4 --names ~/course/data/day4/taxonomy/names.dmp --showfits 2 --nopt 10 --nbootstrap 20 --doboot 1 --seed 1234 --lib ds --out $sample
metaDMG-cpp aggregate $sample.filtered.bdamage.gz --nodes ~/course/data/day4/taxonomy/nodes.dmp --names ~/course/data/day4/taxonomy/names.dmp --lcastat $sample.filtered.stat.gz --dfit $sample.dfit.gz
done < sample.list

Detach from a session Inside tmux, press:

Ctrl + b, d

go out of the screen and now we make one table with all results from bam-filter, with only one header

for file in  *.stats.tsv.gz ;  do echo "$file"; zcat "$file" | tail -n +2 | awk -v filename="$file" '{print filename "\t" $0}' >> bam-filter_stats.tmp ;  done
zcat  PRI-TJPGK-CATN-26-28.stats.tsv.gz | head -1 > bam-filter_header.tsv
cat bam-filter_header.tsv bam-filter_stats.tmp > bam-filter_stats.tsv

Lets view the output

less -S bam-filter_stats.tsv

Using nano or vim or like add a "sample" to the first line before reference, and seperate this with a tab.

Explore key statistics using R, make a few plots of the values that you find interesting/important

Lets fit the observations to the damage model in metaDMG (?ngsDMG?)

And lets aggregate (merge) the lca stats with the fitted stats

Now lets merge the aggregated results from metaDMG and do some damage plots

for file in  *.bdamage.gz.stat.gz ;  do echo "$file"; zcat "$file" | tail -n +2| awk -v filename="$file" '{print filename "\t" $0}' >> metadmg_data2.tsv ; done
head -1 *.bdamage.gz.stat.gz > metadmg_header.tsv
cat metadmg_header.tsv metadmg_data2.tsv > metadmg_data_final.tsv

We can now parse all our alignments for each read to a taxonomic classifier in this case we use ngsLCA. But before this can be done ngsLCA assumes that the bam/sam files are sorted by readID. Therefore we flagged the option -N in bam-filter, which sorts the filtered bam by readID.

ngsLCA a naive least common ancestor algorithm

The ngsLCA (next generation sequence Lowest Common Ancestor algorithm) toolkit includes two modules, ngsLCA main program and ngsLCA R package. The ngsLCA main program provides a fast and flexible taxonomic classification of DNA reads aligned to reference databases containing multiple organisms. The classification builds upon the NCBI taxonomy and performs a naïve lowest common ancestor assignment for reads with multiple alignments against different references. The command line tool outputs a text file in "lca" format. Normally we would download the NCBI taxonomy dump files simultaneously with the databases e.g. RefSeq https://www.ncbi.nlm.nih.gov/refseq/ and nt to ensure overlap between versions. You can explore this after the course through guide https://github.com/miwipe/ngsLCA. In this tutorial we will use a subset of this database combined with a subset of microbial genomes from GTDB https://gtdb.ecogenomic.org/ that has been made compatible with the ncbi taxonomy. It is possible to use a set of custom reference genomes, but one needs to ensure that the accession number in the fasta header of the reference is linking to a taxid and so forth. and that all confines to the NCBI taxonomy format.

conda activate env2

Now let's call on ngsLCA

ngsLCA

this should results in ngsLCA print the version number and the command with options. Like:

-> ngslca version: 0.9 (htslib: 1.17) build(May 18 2023 23:42:00) -> ./ngsLCA -names -nodes -acc2tax [-editdist[min/max] -simscore[low/high] -minmapq -discard] -bam

OBS., Please note that the similarity can be set as an edit distance [-editdist[min/max]], i.e., number of mismatches between the read to reference genome, or as a similarity distance [-simscore[low/high]], i.e., percentage of mismatches between the read to reference genome. Edit distance can be a number between 0-10, while the similarity score is a number between 0-1.

Please discuss in your group what could be a good set of options and note down why you have chosen these. Hint: the mapping quality of the

The ngsLCA algorithm has been implemented and is now continued inside another program we call metaDMG, so lets go to the environment

conda activate metaDMG
metaDMG-cpp lca

Now this should print near to identical command options as the ngsLCA command. If this works let's set metaDMG lca to run.

metaDMG-cpp lca -bam PRI-TJPGK-CATN-26-28.filtered.bam -names ~/course/data/shared/mapping/taxonomy/names.dmp -nodes ~/course/data/shared/mapping/taxonomy/nodes.dmp -acc2tax ~/course/data/shared/mapping/taxonomy/acc2taxid.map.gz -weighttype 1 -fix_ncbi 0 -out PRI-TJPGK-CATN-96-98

Now make a list sorted by time and reversed to print the list of latest files created in the bottom.

ls -ltrh

If ngsCLA has finished its run you will have a .lca.gz, .bdamage.gz, rlens.gz and .stat.gz the two first files are the per-read taxonomic classification and the nucleotide mismatch matrices for each of the references. While the lca.gz is the per-read classification, the other files are data output that is used downstream by metaDMG dfit and aggregate options in order to produce are merged summary of damage, gc-content, read lengths and like. So let's look at the log first.

more PRI-TJPGK-CATN-96-98.log 

If there were issues with finding accession numbers, from the reference genomes, that would be printed as an error here. And the read and its alignments will be ignored.

Now let's have a look at your taxonomic profile for the sample, but zcat'ing the lca output.

zcat PRI-TJPGK-CATN-26-28.filtered.lca.gz | head -40

This output contains per-read classification and each line is a unique read with its own classification. You can have a look at the .lca file format here. Now this is simple text manipulation that will allow you to parse the profile and to further plot this in R or however you usually visualize your data.

Try cutting out the species names

zcat PRI-TJPGK-CATN-26-28.filtered.lca.gz | cut -f 8 -d: | head -40

Try cutting out the taxonomic names of lowest taxonomic level each read has been classified to, sort them and count how many reads was classified to each.

zcat PRI-TJPGK-CATN-26-28.filtered.lca.gz | tail -n +3 | cut -f 12 -d: | sort | uniq -c

Do you know any of these? Let's check the willow family "Salicaceae", using grep.

zcat PRI-TJPGK-CATN-26-28.filtered.lca.gz | grep 'Salicaceae' | head -10

Note, that the reads here are classified at different taxonomic levels. We can also grep out the metazoan e.g. animals.

zcat PRI-TJPGK-CATN-26-28.filtered.lca.gz | grep 'Metazoa' | cut -f 14 -d: | sort | uniq -c

that looks like this

zcat PRI-TJPGK-CATN-26-28.filtered.lca.gz | grep 'Metazoa' | cut -f 14 -d: | sort | uniq -c
   6756 "c__Mammalia"
    119 "f__Elephantidae"
   2327 "f__Hominidae"
      3 "g__Loxodonta"
      3 "k__Metazoa"
    906 "l__Opisthokonta"
    333 "o__Proboscidea"
    556 "p__Chordata"

What does this tell you? Do you know these taxonomic levels? If not perhaps try and google it? You can also use other resources like gbif.org to check for plant and animal geographical distributions.

Now we don't know if all these taxa are ancient or not, just that the reads are unique to the given level, within the similarity thresholds set. But to do the mapDamage exercise for each of these 505 species in this sample (and that is a low count compared to what you can classify in a given sample) would take a lot of time, first finding, downloading, indexing, mapping the uniquely classified reads to each corresponding genomes will take a looooooooong time. And then you will need to do this for all your samples, just to filter away taxa that show now damage, DNA fragmentation or have too few reads for this ancient DNA authentication.

Plotting general statistics on the nature of the data from the lca output in R.

Activate your R environment, OBS! if you want to be able to view the output you should be using the IDE and not the terminal only.

conda activate r 

First we create and add header to a new file

echo "ID,Readlength,Count" > long_table.csv

Then we can loop over the lca output and plot the readlength distribution using ggplot2.

for file in *lca.gz
do
    # Extract the ID from the input filename
    id=$(basename "$file" .filtered.lca.gz)

    # Sort and count the unique numbers in the file, skip first line, and append to long_table.csv
    zcat "$file" | tail -n +3 | cut -f 3 | sort | uniq -c | sed 1d | awk -v id="$id" '{print id "," $2 "," $1}' >> long_table.csv

    # Generate a scatterplot of the data using ggplot2
    Rscript -e "library(ggplot2); data <- read.csv('long_table.csv', header=TRUE, col.names=c('ID', 'Readlength', 'Count')); data <- na.omit(data); myplot <- ggplot(data, aes(x=Readlength, y=Count)) + geom_point() + facet_wrap(~ID, ncol = 2, scales = 'free_y')
ggsave('readlength_distributionTotal.pdf', myplot, width = 20, height = 20)"
done 

This can also be done per superkingdom or which taxonomic level you want. (OBS will take 5 minutes or so)

echo "ID,Category,Readlength,Count" > long_table2.csv

for file in *lca.gz
do
    # Extract the ID from the input filename
    id=$(basename "$file" .filtered.lca.gz)

    # Sort and count the unique numbers in the file, skip first line, and append to long_table.csv
    zcat "$file"  | grep -e 'Archaea' | cut -f 3 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Archaea" "," $2 "," $1 }' >> Archaea_long_table.csv
    zcat "$file"  | grep -e 'Bacteria' -e 'bacteria' | cut -f 3 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Bacteria" "," $2 "," $1 }' >> Bacteria_long_table.csv
    zcat "$file"  | grep -e 'Viridiplantae' | cut -f 3 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Viridiplantae" "," $2 "," $1 }' >> Viridiplantae_long_table.csv
    zcat "$file"  | grep -e 'Virus' | cut -f 3 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Virus" "," $2 "," $1 }' >> Virus_long_table.csv
    zcat "$file"  | grep -e 'Metazoa' | cut -f 3 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Metazoa" "," $2 "," $1 }' >> Metazoa_long_table.csv
    # Create a long table of all long tables
    cat Archaea_long_table.csv Bacteria_long_table.csv Viridiplantae_long_table.csv Virus_long_table.csv Metazoa_long_table.csv >> long_table2.csv

    # Generate a scatterplot of the data using ggplot2
    Rscript -e "library(ggplot2);library(viridisLite); data <- read.csv('long_table2.csv'); plot <- ggplot(data, aes(x = Readlength, y = Count, fill = Category)) + geom_area(alpha = 0.8) + scale_fill_viridis_d() + facet_wrap(~ID, scales = 'free_y', ncol = 2) + labs(x = 'Read Length', y = 'Count', title = 'Read Length Distribution by Category') + theme(plot.title = element_text(hjust = 0.5)); ggsave('readlength_distributionPerKingdom.pdf', plot, width = 13, height = 20)"
done

metaDMG

lets make a merged output we can play with in R, OBS, remember to edit the header in the course_data.tsv file.

An Rscript can be found here as a suggestion ~/course/data/day2/scripts/aeGonmics_Euks20024.R

for file in  *.bdamage.gz.stat.gz ;  do echo "$file"; zcat "$file" | awk -v filename="$file" '{print filename "\t" $0}' >> course_data2.tsv ; done
grep 'taxid' course_data2.tsv -v > data_course_data.tsv
head -1 course_data2.tsv > header.tmp
cat header.tmp data_course_data.tsv > course_data.tsv