Skip to content
Mikkel Winther Pedersen edited this page Oct 4, 2023 · 15 revisions

Ancient DNA Authentication in Time series data

The output from metaDMG now counts some basic stats that you would want to extract for your project but also for potential future publications or like, such as the total number of reads classified to any taxonomic level, total number of reads classified within a superkingdom and like. This can be obtained from the files in the data/lca/ folder and using simple text parsing comments that you have already been using in the course first two days ('wc -l' or 'grep -c'). Please collect total number of reads classified per sample, total number of reads classified to plants per sample and total reads classified to animals per sample.

Please also ensure you have performed the last step in the tutorial from yesterday, e.g. converting the metaDMG output to a .csv file. Today we will explore a few examples of how to visualize such data in R.

First activate r environment and R,

conda activate r
R

Next, load the libraries needed and the data, if you are missing some libraries you can install these by install.packages("name.of.library")

library(tidyverse) 
library(reshape2)
library(vegan)
library(rioja)
library(readxl)
library(ggplot2)
library(dplyr)
library(gghighlight)

First, set working directory

setwd("/Users/npl206/Google Drive/My Drive/aeDNAphDcourse/Mikkel_tutorials/aeGenomics_profiling_damage/")

Import data

df <- read_csv("/Users/npl206/Google Drive/My Drive/aeDNAphDcourse/Mikkel_tutorials/aeGenomics_profiling_damage/metaDMGresults.csv")

Count number of columns in dataframe

ncol(df)

Print all sample names

unique(df$sample)

Import metadata

metaDATA <- read_excel("~/Google Drive/My Drive/aeDNAphDcourse/Mikkel_tutorials/metadata.xlsx", sheet = "metadata")

Merge the metadata by samplename

dt <- merge(df, metaDATA, by = "sample")

Now check that new columns have been added to the dt dataframe

ncol(df) < ncol(dt)

Print all different ages

unique(dt$YearsBP)

Print all column names of the table

colnames(dt)

Maximum amount of damage (filtered for)

DamMin2 = 0.00

MAP_Significance

MapSig2 = 0

Minimum reads for parsing taxa

MinRead2 = 100

Minimum mean read length

MinLength = 35

Subsetting the table by plant (Viridiplantae) genus using grepl and filter, parameters you need to set and possible add more?? Please think about how your group think it should be done. But first would probably be best to explore the data with less stringent filters, and plot these, then later on the basis of the data make decisions on how you want it filtered.

dt2 <- dt %>% filter(MAP_damage > DamMin2, N_reads >= MinRead2, mean_L > MinLength, MAP_significance  > MapSig2,  grepl("Viridiplantae",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", sample))
pdf(file = "aeCourse.DNAdamageModelJitterPlot.pdf", width = 8, height = 4)
ggplot() +
  geom_jitter(data = dt2, aes(x=as.numeric(YearsBP), y=MAP_damage, size = N_reads), alpha =0.5) +
  gghighlight(N_reads > 500) +
  xlab("Years BP") +
  ylab("DNA damage") +
  labs(color = "Values for taxa with \n>500 reads", size = "Number of reads")
dev.off()

Plot plant taxa, highlight taxa with more than 500 reads and add the min, max and median. save as pdf

pdf(file = "aeCourse.DNAdamageLRJitterPlot.pdf", width = 8, height = 4)
ggplot() +
  geom_jitter(data = dt2, aes(x=as.numeric(YearsBP), y=MAP_damage, size = MAP_significance), alpha =0.5) +
  gghighlight(N_reads > 500) +
  xlab("Years BP")+
  ylab("DNA damage") +
  labs(color = "Values for taxa with \n>500 reads", size = "Significance \nfor Taxa with >500 reads")
dev.off()

Create filtered table for DNA damage model

filtered_data <- dt2 %>% filter(N_reads >= 500)

Now you should make decisions on how the data should be filtered. You can use the

MapSig3 = 3
MinRead3 = 100
MinLength3 = 35

Subsetting the table using grepl and filter, parameters you need to set and possible add more?

filtered_data_metazoan <- dt %>% filter(N_reads >= 10, mean_L > MinLength3, MAP_significance  > MapSig3,  grepl("Metazoa",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
unfiltered_data_metazoan <- dt %>% filter(N_reads >= 2, mean_L > MinLength3, MAP_significance  > 1,  grepl("Metazoa",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
filtered_data_viridiplantae <- filtered_data %>% filter(N_reads >= 100, mean_L > MinLength3, MAP_significance  > LR3,  grepl("Viridiplantae",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
unique(filtered_data_viridiplantae$YearsBP)

You can also make a smaller table with only values of your choice example below.

select(filtered_data_viridiplantae, tax_name, MAP_damage, MAP_significance, N_reads, YearsBP)

Count number of unique plant taxa

unique(filtered_data_viridiplantae$tax_name)

Make wide table for downstream plot and data wrangling of the plants.

data_wide_plants <- dcast(filtered_data_viridiplantae, tax_name ~ YearsBP, value.var="N_reads", fun.aggregate = sum)
n <- ncol(data_wide_plants)
b2 <- data_wide_plants[,2:n]
rownames(b2) <- data_wide_plants$tax_name
b2[is.na(b2)] <- 0 #set all NAs as zeros

Prints sum of samples and taxa and depths names

colSums(b2)
rowSums(b2)
colnames(b2)

Test, if this one fails it might have text in the number of reads coloumn

b2[is.na(b2)]=0

Create percentage table calculated the number of columns and subtracts the sum of all reads coloumn, remove the -1 and you will plot the Nreads column on later plots

i=ncol(b2)
b3=as.matrix(b2[,seq(1,i)])  # change number of coloumns to the total of mydata

b4 <- prop.table(data.matrix(b3), margin=2)*100 # makes proportion table, needs 2 margins e.g. header and 1st row names
colSums(prop.table(b4, margin=2)*100) # prints sum of column, should give 100 for each

## Next we will transpose the table, for plotting it as a strat.plot
b5 <- t(b4)
## and set the variable z to be the years BP which is now row headers.
z <- as.numeric(rownames(b5)) # depth/depth

# import library vegan, to perform cluster analysis
library("vegan")

pdf(file = "aeCourse.Stratplot_Plants_area.pdf", width = 15, height = 5)
strat.plot(b5, y.rev=TRUE, plot.line=TRUE, plot.poly=TRUE, plot.bar=FALSE, lwd.bar=10, sep.bar=TRUE, scale.percent=TRUE, xSpace=0.01, x.pc.lab=TRUE, x.pc.omit0=TRUE, srt.xlabel=45, las=2, exag=TRUE, exag.mult=5, ylabel = "years BP",  yvar = z)
dev.off()

Now we will convert the wide table into a long table format to plot it with ggplot

y <- ncol(b5)
b6 <- melt(b5[,1:y])
sapply(b6, class)
colnames(b6) <- c("YearsBP","Taxa", "percentage")

p1 <- ggplot(b6, aes(y=Taxa, x=YearsBP, fill=percentage)) +   geom_tile(colour="lightgrey") +
  theme_minimal() + scale_fill_gradient(low="white", high="darkgreen") + scale_y_discrete(limits=rev)
p1 + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust =1)) + ggtitle("percentage of taxa plotted as heatmap") +
  xlab("YearsBP") + ylab("YearsBP") + labs(fill = "percentage %")

readlentgths per taxa

Phylogenetic placement and Pathphynder analysis

Tools

phyton (v. 3.11.3) - https://www.python.org/ biopython - https://biopython.org/wiki/AlignIO bowtie2(v. 2.3.2) - https://github.com/BenLangmead/bowtie2 samtools - https://github.com/samtools/samtools mafft (v7.427) - https://mafft.cbrc.jp/alignment/software raxml-ng (v. 0.8.1 BETA) - https://github.com/amkozlov/raxml-ng bamcov (v. 0.1) - https://github.com/fbreitwieser/bamcov angsd (v. 0.928) - https://github.com/ANGSD/angsd BEAST (v.1.10.4) - https://beast.community/ or BEAST2 - https://www.beast2.org/ bwa (v. 0.7.17-r1188) - https://github.com/lh3/bwa pathPhynder - https://github.com/ruidlpm/pathPhynder snp-sites - https://github.com/sanger-pathogens/snp-sites phynder - https://github.com/ruidlpm/pathPhynder - https://github.com/richarddurbin/phynder FIGTREE (v1.4.4) - http://tree.bio.ed.ac.uk/software/figtree/

1. Part A. Extraction of reads from lca file

Print sample list based on name files

ls *lca.txt > sample.list

Taxa list file

nano name_genus
> save it as > 'taxa.list'

Create files based on SAMPLE.list and Taxa.list

while read -r line
do
arr=($line)
lib=${arr[0]}
echo $lib
cat taxa.list | parallel -j20 "grep {} $lib | cut -f1,2,3,4,5,6,7 -d: > $lib.{}.readID.txt"
done < sample.list

Remove lines in file were not found name_genus

wc -l *.readID.txt| awk '$1 == 0' | awk '{print $2}' > rm.list
cat rm.list | parallel -j20 "rm {}"

Create sum-up file total sequences found per sample

wc -l *.readID.txt| paste > tot_genus_sequences.txt

Create fastq from readIDs

for infile in *readID.txt
do
bname=$(basename $infile)
bname1=$(echo $bname | sed 's/.readID.txt*/.fq/')   
bname2=$(echo $bname | cut -f1 -d. )
echo $bname1
echo $bname2
seqtk subseq path_to_files_fq/adap2_kmer2_*$bname2*.fq.pp.rmdup.fq $bname > $bname1 &
done

1. Part B: Prepare fasta reference files

Download reference mitochondrial genomes from NCBI or directly on NCBI Nucleotide database

Send to > File > Fasta > Create File

The samples used here were based on paper Taylor et al. 2021:Taylor, William TT, et al. "Evidence for early dispersal of domestic sheep into Central Asia." Nature Human Behaviour 5.9 (2021): 1169-1179.)

See ReferenceTable1.xlsx

Be sure to have removed tab and , " , ' , : from sequence

sed "s/:/_/g" in.fa > out.fa
sed "s/ /_/g" in.fa > out.fa
sed "s/;/_/g" in.fa > out.fa
sed "s/__/_/g" in.fa > out.fa
sed "s/(/./g" in.fa > out.fa
sed "s/)/./g" in.fa > out.fa

Concatenate reference fasta files for each Haplogroup

cat *.fasta > in_to_align.fa

Alignment Haplorgroup references

mafft --thread 40 --leavegappyregion Ovis_aries_musimon2.fasta > Ovis_aries_musimon2_aligned.fa

Create Consensus sequence for each Haplogroup

in Geneious (if available): Tools > Generate Consensus Sequence > majority base calling

Or use Python BioAlign's consensus function (python) - biophyton needs to be installed: conda install -c conda-forge biopython

python /path_to_Script/get_consensus.txt Reference_aligned.fa Reference_output_Consensus.fa  

Header consensus needs to be renamed

for file in *.fa
do
bn=`basename $file .fa`
awk '/^>/ {gsub(/.fa(sta)?$/,"",FILENAME);printf(">%s\n",FILENAME);next;} {print}' $file > ${bn}_renamed.fa ; rm $file
done

Concatenate all consensus sequences

cat *Consensus.fa > MSA_to_align.fa

Alignment references

mafft --thread 40 --leavegappyregion MSA_to_align.fa  > MSA_out_aligned.fa

Build a reference Tree (check of a Maximum Likelihood Tree)

raxml-ng --msa MSA_out_aligned.fa --model GTR+G --prefix output_name --threads 2

If any warnings, run the '*reduced.phy' files to get a clean Tree.

2. Mapping to Consensus references of Ovis Haplogroups

Build indexes for database

for file in *.fa
do
bname=$(basename $file)
bowtie2-build -f $file DB_$bname
done

Mapping in for loop

for file in $(pwd)/*.fq
do
bname=$(basename $file)
echo $bname
bname2=$(echo $bname | cut -f1-3 -d_)
echo $bname2
bname3=$(echo $bname | sed 's/.fq.pp.rmdup.fq*/_holi/')
basepath=$(pwd)/
basefolder=$basepath
echo $basepath
echo $bname3
mkdir $basepath$bname3
cd $basepath$bname3
done

mapping to the different databases

for DB in /path_to_DB/name_DB
do
echo Mapping adap2_kmer2_$bname.pp.rmdup.fq against $DB
bowtie2 --threads 20 -x $DB -U $file --no-unal | samtools view -bS - > $bname2.$(basename $DB).bam
done

consider different settings for bowtie2

for DB in /path_to_DB/name_DB
do
echo Mapping adap2_kmer2_$bname.pp.rmdup.fq against $DB
bowtie2 --threads 20 -x $DB -U $file --no-unal | samtools view -bS - > $bname2.$(basename $DB).bam
done

Sorting bam files

for bam in *.bam
do
echo Sorting bam files
samtools sort -O BAM -o sort.$bam $bam
done

getting coverage stats

for sort in sort.*.bam
do
echo Bamcov $sort
/path_to_bamcov/bamcov -m -w0 $sort | paste > bamcov_hist_$sort.txt
/path_to_bamcov/bamcov $sort | paste > bamcov_table_$sort.txt
done

cd $basepath
done

Important to assess the coverage of the mapped reads to different mitogenomes before proceeding with the tree.

4. Prepare alignement files for tree

move all sort.bam in a folder

mkdir new_folder

for file in *holi/sort.*.bam
do
mv $file new_folder
done

consensus of reads bam > fast

for file in sort*.bam
do
angsd -doFasta 2 -doCounts 1 -i $file -out $file.consensus.fa
done

gunzip

for file in *.consensus.fa.fa.gz
do
bname=$(basename $file)
echo $bname
gunzip $file
done

Delete file size 0

for file in *.consensus.fa.fa
do
bname=$(basename $file)
echo $bname
find $file -size 0 -delete
done

rename headers

for file in *.fa.fa
do
bn=`basename $file .fa.fa`
awk '/^>/ {gsub(/.fa(sta)?$/,"",FILENAME);printf(">%s\n",FILENAME);next;} {print}' $file > ${bn}_renamed.fa ; rm $file
done

5. (Optional) Building Maximum Likelihood Tree

Cconcatenate sequences

cat *renamed.fa MSA_not_aligned.fa > Allref_mycons.fa

alignment

mafft --thread 40 --leavegappyregion Allref_mycons.fa > Allref_mycons_aligned.fa

tree

raxml-ng --msa output_name.raxml.reduced.phy --model GTR+G --prefix output_nameRED --threads 2

rerun using reduced.phy file if optimised alignment by raxml is suggested

6. Building Tree in BEAST: this can take some time, tree file should be provided already.

We ran a tree in Beast based on the msa file for references and for references with sample files. Parameters used in BEAST (v.1.10.4) with 20,000,000 replicates, applying the HKY model and a Coalescent Constant Population as prior. BEAUTI > import alignment, set parameters, save file as Aln_mafft_taxa_references.xml

run BEAST or BEAST2 on server:

beast -threads n Aln_mafft_taxa_references.xml

Check log file TRACER > open .log file and check burnin, if not okay, increase replicates

Saving Tree

TreeAnnotator → open .trees file and save it as a Maximum Credibility Tree with 10% Burnin when saving/naming the file you have to add the file extension .nexus

If using BEAST2 the tree will be saved with .tree extension that can be read by FIGTREE.

Save Newick Tree

FIGTREE > Open tree.nexus > export tree as Newick > save with .nwk #Visualise node labels/branch labels as posterior values

Check .nwk file

Check with text editor if tree.nwk has any ‘ symbols, if so, remove them

7. Pathphynder

to call those SNPs in a given dataset of ancient samples and find the best path and branch where these can be mapped in the tree.

create VCF with snp-sites from multiple sequence alignment (MSA) file

snp-sites -v -c -o Aln_mafft_taxa_references.vcf Aln_mafft_taxa_references.fa

fix vcf file '/home/mxd270/R/x86_64-pc-linux-gnu-library/4.3' installed stringr package

Rscript /path_to_Rscript/fix_vcf.R Aln_mafft_taxa_references.vcf Aln_mafft_taxa_references_output.vcf

fix consensus naming problem in vcf file (replace 1’s with “consensus”)

awk '{ if ($1 == "1") $1="consensus";}1' Aln_mafft_taxa_references_output.vcf | sed 's/ /\t/g' > Aln_mafft_taxa_references_fixed_output.vcf

create consensus for MSA.fa and index it###

python /path_To_Script/get_consensus.txt Aln_mafft_taxa_references.fa Cons_Aln_mafft_taxa_references.fa

bwa index Cons_Aln_mafft_taxa_references.fa

Create directory in pathPhynder_analysis folder

mkdir map_to_cons

bwa mapping of Ovis reads to consensus

bwa aln -l 1024 -n 0.001 -t 10 /path_to_folder/Consensus_Mafft_All_references.fa /path_to_folder/library.fq
| bwa samse /path_to_folder/Consensus_Mafft_All_references.fa  - /path_to_folder/library.fq
| samtools view -F 4 -q 25 -@ 10 -uS - | samtools sort -@ 10 -o library.sort.bam

Count number of mapped reads

samtools view -c file.bam

8. Run pathPhynder

Create directory in pathPhynder_analysis folder

mkdir pathphynder_results

Assign informative SNPs to tree branches

phynder -B -o /path_to_folder/branches.snp /path_to_folder/Mafft_All_reference.nwk /path_to_folder/Mafft_All_references_fixed_consensus.vcf

Prepare data - this will output a bed file for calling variants and tables for phylogenetic placement

pathPhynder -s prepare -i /path_to_folder/Mafft_All_reference.nwk -p taxa_pathphynder_tree -f /path_to_folder/branches.snp -r /path_to_folder/Consensus_Mafft_All_references.fa

PathPhynder command1: ### only transversions

pathPhynder -s all -t 100 -m transversions -i /path_to_folder/Mafft_All_references.nwk -p /path_to_folder/taxa_pathphynder_tree -l /path_to_folder/bamlist.txt -r /path_to_folder/Consensus_Mafft_All_references.fa

PathPhynder command2: ###transitions and transvertions

pathPhynder -s all -t 100 -i /path_to_folder/Mafft_All_references.nwk -p /path_to_folder/taxa_pathphynder_tree -l /path_to_folder/bamlist.txt -r /path_to_folder/Consensus_Mafft_All_references.fa

Visualise pdf files with the results.