-
Notifications
You must be signed in to change notification settings - Fork 4
Day 2
activate conda environment and go to mapping folder
conda activate day1
cd ~/course/wdir/mapping
We can now explore the alignment files with samtools. Remember you can use the view function in explore the bam/sam file. try to specify the -h and -H options. Notice that samtools log commands at the bottom of the -H (print header only) option. Also, explore a bit the bam file format, and see if you can recognize any of the information in there? You can also try out the -c option which counts the number of alignments in your bamfile.
samtools view PRI-TJPGK-CATN-96-98.bam | head -40
We can now parse all our alignments for each read to a taxonomic classifier in this case we use ngsLCA. But before this can be done ngsLCA assumes that the bam/sam files are sorted by readID. This can be done with the samtools suite more specifically the samtools sort command with the option -n.
samtools sort -n PRI-TJPGK-CATN-96-98.bam -@ 5 > PRI-TJPGK-CATN-96-98.sort.bam
The ngsLCA (next generation sequence Lowest Common Ancestor algorithm) toolkit includes two modules, ngsLCA main program and ngsLCA R package. The ngsLCA main program provides a fast and flexible taxonomic classification of DNA reads aligned to reference databases containing multiple organisms. The classification builds upon the NCBI taxonomy and performs a naïve lowest common ancestor assignment for reads with multiple alignments against different references. The command line tool outputs a text file in "lca" format. Normally we would download the NCBI taxonomy dump files simultaneously with the databases e.g. RefSeq https://www.ncbi.nlm.nih.gov/refseq/ and nt to ensure overlap between versions. You can explore this after the course through guide https://github.com/miwipe/ngsLCA. In this tutorial we will use a subset of this database combined with a subset of microbial genomes from GTDB https://gtdb.ecogenomic.org/ that has been made compatible with the ncbi taxonomy. It is possible to use a set of custom reference genomes, but one needs to ensure that the accession number in the fasta header of the reference is linking to a taxid and so forth. and that all confines to the NCBI taxonomy format.
conda activate day2
Now let's call on ngsLCA
ngsLCA
this should results in ngsLCA print the version number and the command with options. Like:
-> ngslca version: 0.9 (htslib: 1.17) build(May 18 2023 23:42:00) -> ./ngsLCA -names -nodes -acc2tax [-editdist[min/max] -simscore[low/high] -minmapq -discard] -bam
OBS., Please note that the similarity can be set as an edit distance [-editdist[min/max]], i.e., number of mismatches between the read to reference genome, or as a similarity distance [-simscore[low/high]], i.e., percentage of mismatches between the read to reference genome. Edit distance can be a number between 0-10, while the similarity score is a number between 0-1.
Please discuss in your group what could be a good set of options and note down why you have chosen these. Hint: the mapping quality of the
The ngsLCA algorithm has been implemented and is now continued inside another program we call metaDMG, so lets go to the environment
conda activate metaDMG
metaDMG-cpp lca
Now this should print near to identical command options as the ngsLCA command. If this works let's set metaDMG lca to run.
metaDMG-cpp lca -bam PRI-TJPGK-CATN-96-98.sort.bam -names ~/course/data/shared/mapping/taxonomy/names.dmp -nodes ~/course/data/shared/mapping/taxonomy/nodes.dmp -acc2tax ~/course/data/shared/mapping/taxonomy/acc2taxid.map.gz -weighttype 1 -fix_ncbi 0 -out PRI-TJPGK-CATN-96-98
Now make a list sorted by time and reversed to print the list of latest files created in the bottom.
ls -ltrh
If ngsCLA has finished its run you will have a .lca and .log (and .bin which is a binary look up file) the two first files are the per-read taxonomic classification and the log which outputs error messages if any. The PRI-TJPGK-CATN-96-98.bdamage.gz and the PRI-TJPGK-CATN-96-98.stat are files created for the DNA damage estimation, and are not part of the ngsLCA output, but are generated during the metaDMG implementation of ngsLCA as we were going through the bam file anyways. So let's look at the log first.
more PRI-TJPGK-CATN-96-98.log
If there were issues with finding accession numbers, from the reference genomes, that would be printed as an error here. And the read and its alignments will be ignored.
Now lets have a look at your taxonomic profile for the sample, but zcat'ing the lca output.
zcat PRI-TJPGK-CATN-96-98.lca.gz | head -40
This output contains per-read classification and each line is a unique read with its own classification. You can have a look at the .lca file format here. Now this is simple text manipulation that will allow you to parse the profile and to further plot this in R or however you usually visualize your data.
Try cutting out the species names
zcat PRI-TJPGK-CATN-96-98.lca.gz | cut -f 14 -d: | head -40
Try cutting out the taxonomic names of lowest taxonomic level each read has been classified to, sort them and count how many reads was classified to each.
zcat PRI-TJPGK-CATN-96-98.lca.gz | cut -f 14 -d: | sort | uniq -c
Do you know any of these? Let's check the rose family "Rosaceae", using grep.
zcat PRI-TJPGK-CATN-96-98.lca.gz | grep 'Rosaceae' | head -10
Note, that the reads here are classified at different taxonomic levels. We can also grep out the Ursidae family.
zcat PRI-TJPGK-CATN-96-98.lca.gz | grep 'Ursidae' | cut -f 14 -d: | sort | uniq -c
Do you know this family? If not perhaps try and google it? You can also use other resources like gbif.org.
Now we don't know if all these taxa are ancient or not, just that the reads are unique to the given level, within the similarity thresholds set. But to do the mapDamage exercise for each of these 505 species in this sample (and that is a low count compared to what you can classify in a given sample) would take a lot of time, first finding, downloading, indexing, mapping the uniquely classified reads to each corresponding genomes will take a looooooooong time. And then you will need to do this for all your samples, just to filter away taxa that show now damage, DNA fragmentation or have too few reads for this ancient DNA authentication.
Activate your R environment, OBS! if you want to be able to view the output you should be using the IDE and not the terminal only.
conda activate r
First we create and add header to a new file
echo "ID,Readlength,Count" > long_table.csv
Then we can loop over the lca output and plot the readlength distribution using ggplot2.
for file in *lca.gz
do
# Extract the ID from the input filename
id=$(basename "$file" .lca.gz)
# Sort and count the unique numbers in the file, skip first line, and append to long_table.csv
zcat "$file" | cut -d':' -f9 | sort | uniq -c | sed 1d | awk -v id="$id" '{print id "," $2 "," $1}' >> long_table.csv
# Generate a scatterplot of the data using ggplot2
Rscript -e "library(ggplot2); data <- read.csv('long_table.csv'); myplot <- ggplot(data, aes(x=Readlength, y=Count)) + geom_point() + facet_wrap(~ID, ncol = 2, scales = 'free_y'); ggsave('readlength_distributionTotal.pdf', myplot, width = 6, height = 4)"
done
This can also be done per superkingdom or which taxonomic level you want. (OBS will take 5 minutes or so)
echo "ID,Category,Readlength,Count" > long_table2.csv
for file in *lca.gz
do
# Extract the ID from the input filename
id=$(basename "$file" .lca.gz)
# Sort and count the unique numbers in the file, skip first line, and append to long_table.csv
zcat "$file" | grep -e 'Archaea' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Archaea" "," $2 "," $1 }' >> Archaea_long_table.csv
zcat "$file" | grep -e 'Bacteria' -e 'bacteria' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Bacteria" "," $2 "," $1 }' >> Bacteria_long_table.csv
zcat "$file" | grep -e 'Viridiplantae' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Viridiplantae" "," $2 "," $1 }' >> Viridiplantae_long_table.csv
zcat "$file" | grep -e 'Virus' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Virus" "," $2 "," $1 }' >> Virus_long_table.csv
zcat "$file" | grep -e 'Metazoa' | cut -d':' -f9 | sort -n | uniq -c | sed 1d | awk -v id="$id" '{print id "," "Metazoa" "," $2 "," $1 }' >> Metazoa_long_table.csv
# Create a long table of all long tables
cat Archaea_long_table.csv Bacteria_long_table.csv Viridiplantae_long_table.csv Virus_long_table.csv Metazoa_long_table.csv >> long_table2.csv
# Generate a scatterplot of the data using ggplot2
Rscript -e "library(ggplot2);library(viridisLite); data <- read.csv('long_table2.csv'); plot <- ggplot(data, aes(x = Readlength, y = Count, fill = Category)) + geom_area(alpha = 0.8) + scale_fill_viridis_d() + facet_wrap(~ID, scales = 'free_y', ncol = 2) + labs(x = 'Read Length', y = 'Count', title = 'Read Length Distribution by Category') + theme(plot.title = element_text(hjust = 0.5)); ggsave('readlength_distributionPerKingdom.pdf', plot, width = 5, height = 7)"
done
Now we will compute DNA damage and statistics using metaDMG.
conda activate metaDMG
Create configuration file = config.yaml which defines the samples and taxonomy used as well as the LCA and damage calculation settings (see https://github.com/miwipe/ngsLCA for details) OBS input files can be in .sam, .sam.gz and bam file formats (they are the same just different compressions). The taxonomy should be identical to the version of database you use, it can be NCBI or custom.
metaDMG config *.sort.bam --names ~/course/data/shared/mapping/taxonomy/names.dmp --nodes ~/course/data/shared/mapping/taxonomy/nodes.dmp --acc2tax ~/course/data/shared/mapping/taxonomy/acc2taxid.map.gz -m /usr/local/bin/metaDMG-cpp
Edit the config file according to wishes, study and data. set custom database = true, set number of threads to 4.
vim config.yaml
Then compute the statistics
metaDMG compute config.yaml
And now it is time to explore the data, first using the GUI
metaDMG dashboard config.yaml
You should now be prompted to open a new browser window, press ok to the TRUST message, and open. Then we go for an interactive exploration of the dataset.
Once done browsing the GUI, press control + c to exit
Now when datasets grow to a larger size and you want to handle it in R or like you can convert the output in the /data/results/ folder to a merged csv file.
metaDMG convert --output path2/results.csv --add-fit-predictions