-
Notifications
You must be signed in to change notification settings - Fork 1
Day 3
The output from metaDMG now counts some basic stats that you would want to extract for your project but also for potential future publications or like, such as the total number of reads classified to any taxonomic level, total number of reads classified within a superkingdom and like. This can be obtained from the files in the data/lca/ folder and using simple text parsing comments that you have already been using in the course first two days ('wc -l' or 'grep -c'). Please collect total number of reads classified per sample, total number of reads classified to plants per sample and total reads classified to animals per sample.
Please also ensure you have performed the last step in the tutorial from yesterday, e.g. converting the metaDMG output to a .csv file. Today we will explore a few examples of how to visualize such data in R.
First activate r environment and R,
conda activate r
R
Next, load the libraries needed and the data, if you are missing some libraries you can install these by install.packages("name.of.library")
library(tidyverse)
library(reshape2)
library(vegan)
library(rioja)
library(ggplot2)
library(dplyr)
library(gghighlight)
First, set working directory
setwd("~/course/wdir/mapping/plots/")
Import data
df <- read_csv("~/course/wdir/mapping/metaDMGresults.csv")
Count number of columns in dataframe, just to familiarize your self with it.
ncol(df)
Print all sample names
unique(df$sample)
Import metadata
metaDATA <- read.delim("~/course/data/shared/metadata/metadata.tsv")
Replace header "sample_name" with "sample"
colnames(metaDATA)[colnames(metaDATA) == "sample_name"] <- "sample"
colnames(metaDATA)[colnames(metaDATA) == "years_bp"] <- "YearsBP"
Merge the metadata and dataframe by sample
dt <- merge(df, metaDATA, by = "sample")
Now check that new columns have been added to the dt dataframe
ncol(df) < ncol(dt)
Print all different ages
unique(dt$YearsBP)
Print all column names of the table
colnames(dt)
Maximum amount of damage (filtered for)
DamMin2 = 0.00
MAP_Significance
MapSig2 = 0
Minimum reads for parsing taxa
MinRead2 = 100
Minimum mean read length
MinLength = 35
Subsetting the table by plant (Viridiplantae) genus using grepl and filter, parameters you need to set and possible add more?? Please think about how your group think it should be done. But first would probably be best to explore the data with less stringent filters, and plot these, then later on the basis of the data make decisions on how you want it filtered.
dt2 <- dt %>% filter(MAP_damage > DamMin2, N_reads >= MinRead2, mean_L > MinLength, MAP_significance > MapSig2, grepl("Viridiplantae",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", sample))
Now plot your data, you can add other variables in the gghighlight to illustrate different cut-offs, or other types of values etc.
pdf(file = "aeCourse.DNAdamageModelJitterPlot.pdf", width = 8, height = 4)
ggplot() +
geom_jitter(data = dt2, aes(x=as.numeric(YearsBP), y=MAP_damage, size = N_reads), alpha =0.5) +
gghighlight(N_reads > 500) +
xlab("Years BP") +
ylab("DNA damage") +
labs(color = "Values for taxa with \n>500 reads", size = "Number of reads")
dev.off()
Plot plant taxa, highlight taxa with more than 500 reads and add the min, max and median. save as pdf
pdf(file = "aeCourse.DNAdamageLRJitterPlot.pdf", width = 8, height = 4)
ggplot() +
geom_jitter(data = dt2, aes(x=as.numeric(YearsBP), y=MAP_damage, size = MAP_significance), alpha =0.5) +
gghighlight(N_reads > 500) +
xlab("Years BP")+
ylab("DNA damage") +
labs(color = "Values for taxa with \n>500 reads", size = "Significance \nfor Taxa with >500 reads")
dev.off()
Create filtered table for DNA damage model
filtered_data <- dt2 %>% filter(N_reads >= 500)
Now you should make decisions on how the data should be filtered. You can use the
MapSig3 = 3
MinRead3 = 100
MinLength3 = 35
Subsetting the table using grepl and filter, parameters you need to set and possible add more?
filtered_data_metazoan <- dt %>% filter(N_reads >= 10, mean_L > MinLength3, MAP_significance > MapSig3, grepl("Metazoa",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
unfiltered_data_metazoan <- dt %>% filter(N_reads >= 2, mean_L > MinLength3, MAP_significance > 1, grepl("Metazoa",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
filtered_data_viridiplantae <- filtered_data %>% filter(N_reads >= 100, mean_L > MinLength3, MAP_significance > LR3, grepl("Viridiplantae",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
unique(filtered_data_viridiplantae$YearsBP)
You can also make a smaller table with only values of your choice example below.
select(filtered_data_viridiplantae, tax_name, MAP_damage, MAP_significance, N_reads, YearsBP)
Count number of unique plant taxa
unique(filtered_data_viridiplantae$tax_name)
Make wide table for downstream plot and data wrangling of the plants.
data_wide_plants <- dcast(filtered_data_viridiplantae, tax_name ~ YearsBP, value.var="N_reads", fun.aggregate = sum)
n <- ncol(data_wide_plants)
b2 <- data_wide_plants[,2:n]
rownames(b2) <- data_wide_plants$tax_name
b2[is.na(b2)] <- 0 #set all NAs as zeros
Prints sum of samples and taxa and depths names
colSums(b2)
rowSums(b2)
colnames(b2)
Test, if this one fails it might have text in the number of reads coloumn
b2[is.na(b2)]=0
Create percentage table, by taking the number of columns and divides these with sum of all reads in the column, remove the -1 and you will plot the Nreads column on later plots
i=ncol(b2)
b3=as.matrix(b2[,seq(1,i)])
b4 <- prop.table(data.matrix(b3), margin=2)*100 # makes proportion table, needs 2 margins e.g. header and 1st row names
colSums(prop.table(b4, margin=2)*100) # prints sum of column, should give 100 for each
Next we will transpose the table, for plotting it as a strat.plot
b5 <- t(b4)
and set the variable z to be the years BP which is now row headers.
z <- as.numeric(rownames(b5)) # depth/depth
and plot it on a stratigraphic plot (typical pollen type plot)
pdf(file = "aeCourse.Stratplot_Plants_area.pdf", width = 15, height = 5)
strat.plot(b5, y.rev=TRUE, plot.line=TRUE, plot.poly=TRUE, plot.bar=FALSE, lwd.bar=10, sep.bar=TRUE, scale.percent=TRUE, xSpace=0.01, x.pc.lab=TRUE, x.pc.omit0=TRUE, srt.xlabel=45, las=2, exag=TRUE, exag.mult=5, ylabel = "years BP", yvar = z)
dev.off()
Now we will convert the wide table into a long table format to plot it with ggplot
y <- ncol(b5)
b6 <- melt(b5[,1:y])
sapply(b6, class)
colnames(b6) <- c("YearsBP","Taxa", "percentage")
p1 <- ggplot(b6, aes(y=Taxa, x=YearsBP, fill=percentage)) + geom_tile(colour="lightgrey") +
theme_minimal() + scale_fill_gradient(low="white", high="darkgreen") + scale_y_discrete(limits=rev)
p1 + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust =1)) + ggtitle("percentage of taxa plotted as heatmap") +
xlab("YearsBP") + ylab("YearsBP") + labs(fill = "percentage %")
These were just a few examples, it is possible to replace values and plot data in a wide variety of ways. Which you need and will use depends on your data, your study and the story you want to tell.
Euka - a tetrapodic and arthropodic taxa detection from modern and ancient environmental DNA using pangenomic reference graphs
If you are looking for other available options just run ~/course/data/vgan/bin/vgan euka for the manual.
Within the vgan folder you you can find a folder called bin where the vgan executable can be found. So, to run euka you can just use the following command:
~/course/data/vgan/bin/vgan euka -fq1 <(zcat ~/course/wdir/mapping/PRI-TJPGK-CATN-96-98.fq.gz.vs.fq.gz) -o PRI-TJPGK-CATN-96-98 -t 5
If you want to input multiple fastq files at once, please use a file descriptor otherwise euka will not recognize all the files. For example:
or
~/course/data/vgan/bin/vgan euka -fq1 <(zcat ~/course/wdir/mapping/*vs.fq.gz) -o all_samples -t 5
The goal of this part of the tutorial is to place a consensus mitogenome of the elephantid reads into a phylogeny. To do this, you will extract the elephantid reads, re-map them to a reference mitogenome, call a consensus mitogenome, and place the mitogenome into a phylogenetic context.
conda activate env2
mkdir ~/course/wdir/phylogenetic_placement
cd ~/course/wdir/phylogenetic_placement
Extract a list of elephantid read IDs from the lca file in the 'mapping' working directory. Update the name of the '.filtered.metaDMG.bam.lca.gz' file as appropriate.
zgrep 'Elephantidae' ~/course/wdir/mapping/PRI-TJPGK-CATN-26-28.filtered.metaDMG.bam.lca.gz | cut -f1 > PRI-TJPGK-CATN-26-28.elephantidae.readID.txt
check how many read IDs were extracted wc -l PRI-TJPGK-CATN-26-28.elephantidae.readID.txt
extract the elephantid-assigned reads in fastq format seqtk subseq ~/course/data/day1/fastq/PRI-TJPGK-CATN-26-28.fq.gz PRI-TJPGK-CATN-26-28.elephantidae.readID.txt | gzip > PRI-TJPGK-CATN-26-28.elephantidae.fq.gz
check how many reads were extracted. The count should be same as read ID count calculated above zcat PRI-TJPGK-CATN-26-28.elephantidae.fq.gz | awk '{s++}END{print s/4}'
retrieve the woolly mammoth reference mitochondrial genome from NCBI mkdir ~/course/wdir/reference_mitogenomes wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_007596.2&rettype=fasta" -O ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta
check that the woolly mammoth reference mitochondrial genome downloaded ok head ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta
build a bowtie2 reference of the woolly mammoth mitochondrial genome. Note that bowtie2 will output the index files to the current working directory, so we need to change directory, build the reference, and then change back cd ~/course/wdir/reference_mitogenomes bowtie2-build Mammuthus_primigenius_mtDNA.fasta Mammuthus_primigenius_mtDNA.fasta cd ~/course/wdir/phylogenetic_placement
check that the build worked. There should be six .bt2 files. ls ~/course/wdir/reference_mitogenomes
map the elephantid-assigned reads to the woolly mammoth mitochondrial genome. Here, we are using a map-quality cut-off of 20. bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 20 -bS - > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.bam
The standard output gives the bowtie2 alignment statistics. How many reads were not aligned with bowtie2?
check the count of reads in the final bam file using samtools samtools flagstat PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.bam | head -1
Why is there a discrepancy between the aligned read count from samtools and bowtie2?
first, we need to sort the bam file using samtools samtools sort PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.bam > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam
for fun, we can calculate the mean depth of coverage for the bam file using samtools samtools depth -a PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam | cut -f3 | datamash mean 1
calling the consensus using angsd. As angsd may give a shared library load error (you can check), we can reinstall from conda. If doing this, hit 'y' when prompted. conda install bioconda::angsd angsd -dofasta 2 -docounts 1 -minmapq 20 -minq 25 -i PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam -out PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort gzip -d PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa.gz
Now edit the fasta header and rename it sample26_mamPri_mq20. We will use the nano text editor. Install and hit 'y' when prompted. conda install nano nano PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa
Make the edit by replacing '>NC_007596.2' with '>sample26_mamPri_mq20' on the first line. Exit with 'Ctrl+X' followed by 'y' followed by 'enter'.
Download the 45 elephantid mitogenomes wget "https://www.dropbox.com/scl/fi/5kit6qpxuvschmu3r4zy0/45_elephantidae.fa?rlkey=chidwctyt82yajn4cuh99qx20&dl=0" -O Elephantidae_45_mtDNA.fa
cat Elephantidae_45_mtDNA.fa PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa > 45_elephantidae_sample26.fa
mafft --thread 5 45_elephantidae_sample26.fa > 45_elephantidae_sample26.mafft
Build a Maximum Likelihood phyloegentic tree with bootstrap support. First, we need to install raxml-ng. Hit 'y' when prompted during installation. The following command uses five threads for analytical speed, 200 bootstrap replicates for support, and sets the two American mastodon (Mammut americanum) sequences as outgroup. The --redo flag allows for file overwriting if the analysis needs to be re-run.
conda install raxml-ng raxml-ng --all --msa 45_elephantidae_sample26.mafft --model GTR+G --prefix EleMT --threads 5 --seed 2 --bs-trees 200 --outgroup "ENA|EF632344|EF632344.1.Mammut.americanum.mitochondrion.complete.genome.","ENA|KY364233|KY364233.1.Mammut.americanum.mitochondrion.complete.genome." --redo
The key output file is 'EleMT.raxml.support'. If we add the '.tre' extension, then this file can be opened by a variety of tree editors cp EleMT.raxml.support EleMT.raxml.support.tre
The EleMT.raxml.support.tre tree file can be downloaded and visualized on http://etetoolkit.org/treeview/ or with FigTree (https://github.com/rambaut/figtree/releases)
Optional additional exercise: what are the impacts of reference bias? Here we use a higher map quality cut-off and align to a different reference mitogenome (African elephant) to test the consistency of our results.
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000934.1&rettype=fasta" -O ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta head ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta cd ~/course/wdir/reference_mitogenomes bowtie2-build Loxodonta_africana_mtDNA.fasta Loxodonta_africana_mtDNA.fasta cd ~/course/wdir/phylogenetic_placement ls ~/course/wdir/reference_mitogenomes bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 30 -bS - > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.bam bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 20 -bS - > PRI-TJPGK-CATN-26-28.Loxodonta_africana.bam bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 30 -bS - > PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.bam samtools flagstat PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.bam | head -1 samtools flagstat PRI-TJPGK-CATN-26-28.Loxodonta_africana.bam | head -1 samtools flagstat PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.bam | head -1 samtools sort PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.bam > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam samtools sort PRI-TJPGK-CATN-26-28.Loxodonta_africana.bam > PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam samtools sort PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.bam > PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam samtools depth -a PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam | cut -f3 | datamash mean 1 samtools depth -a PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam | cut -f3 | datamash mean 1 samtools depth -a PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam | cut -f3 | datamash mean 1 angsd -dofasta 2 -docounts 1 -minmapq 30 -minq 25 -i PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam -out PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort gzip -d PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.fa.gz angsd -dofasta 2 -docounts 1 -minmapq 20 -minq 25 -i PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam -out PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort gzip -d PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.fa.gz angsd -dofasta 2 -docounts 1 -minmapq 30 -minq 25 -i PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam -out PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort gzip -d PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.fa.gz
nano PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.fa Make the edit by replacing '>NC_007596.2' with '>sample26_mamPri_mq30' on the first line. # Exit with 'Ctrl+X' followed by 'y' followed by 'enter'. nano PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.fa Make the edit by replacing '>NC_007596.2' with '>sample26_loxAfr_mq20' on the first line. # Exit with 'Ctrl+X' followed by 'y' followed by 'enter'. nano PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.fa Make the edit by replacing '>NC_007596.2' with '>sample26_loxAfr_mq30' on the first line. # Exit with 'Ctrl+X' followed by 'y' followed by 'enter'.
cat Elephantidae_45_mtDNA.fa PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.fa PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.fa PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.fa > 45_elephantidae_sample26_v2.fa mafft --thread 5 45_elephantidae_sample26_v2.fa > 45_elephantidae_sample26_v2.mafft raxml-ng --all --msa 45_elephantidae_sample26_v2.mafft --model GTR+G --prefix EleMT_v2 --threads 5 --seed 2 --bs-trees 200 --outgroup "ENA|EF632344|EF632344.1.Mammut.americanum.mitochondrion.complete.genome.","ENA|KY364233|KY364233.1.Mammut.americanum.mitochondrion.complete.genome." --redo cp EleMT_v2.raxml.support EleMT_v2.raxml.support.tre
The EleMT_v2.raxml.support.tre tree file can be downloaded and visualized on http://etetoolkit.org/treeview/ or with FigTree (https://github.com/rambaut/figtree/releases)
Are there any differences in the mitogenome branch lengths and their phylogenetic positions? Why might there be these differences?
Unfortunately we couldn't get quicksand to run on the virtual machines, so I ran quicksand on the test samples, and then copied the output to $qsdir
.
conda activate env2
conda install kallisto=0.50
qsdir=~/course/data/day3/results/quicksand
refdir=~/course/data/day3/db/refseq_mtdna_genomes_byfam
ls $qsdir
ls $refdir | head
find $qsdir | head
Refseq mtDNA genomes are in $refdir, organized by Family.
Quicksand output is also organized by Family:
ls $qsdir/out
less -S $qsdir/final_report.tsv
awk 'NR == 1 || ($22 == "++" && $24 > 500)' $qsdir/final_report.tsv | less
ancient_fams=$(awk '$22 == "++" && $24 > 500' $qsdir/final_report.tsv | cut -f11)
echo $ancient_fams
mkdir -p ~/course/wdir/kal
cd ~/course/wdir/kal
for fam in $ancient_fams ; do
b=$qsdir/out/$fam/1-extracted/PRI-TJPGK-CATN-26-28.fq_extractedReads-$fam.bam
samtools fastq $b | gzip -c > $fam.reads.fq.gz
done
for fam in $ancient_fams ; do
ls $refdir/$fam.fasta.gz
zcat $refdir/$fam.fasta.gz | grep -c '>'
done
for fam in $ancient_fams ; do
ls $refdir/$fam.fasta.gz
kallisto index -i $fam.k21.idx -k21 $refdir/$fam.fasta.gz
done
for fam in $ancient_fams ; do
kallisto quant -o $fam.kal --single -l50 -s10 -i $fam.k21.idx $fam.reads.fq.gz
done
sort -k5,5nr Hominidae.kal/abundance.tsv
for fam in $ancient_fams ; do
echo
echo $fam
sort -k5,5nr $fam.kal/abundance.tsv | awk '{print $4, $1}' | head
done
kallisto index -i hominin_mtdna.noest.fa.k21.idx -k21 $refdir/../hominin_mtdna.noest.fa
kallisto quant -o hominin_mtdna.noest.kal --single -l50 -s10 -i hominin_mtdna.noest.fa.k21.idx Hominidae.reads.fq.gz
Now that we have produced/found out likely ancient dataset for taxa showing damage with mean read length > 35 and also verified a few animal taxa in the profiles we could go back and use the Rscripts for the read length plotting we used yesterday and replace the overall super kingdoms with taxa that would be interesting and or necessary to have further verifications for.
OBS needs reporting of the finding above and extraction of the reads for depth of coverage and phylo placement.
conda activate env1
Nuclear and mito genomes can be downloaded from any source here we use NCBI.
Now lets check the accession number/ID in the header of the fasta
head -1 MF770243.1.fa
And then check that the accession to taxid matches in the ncbi taxonomy
zgrep MF770243 ../../../data/shared/taxonomy/acc2taxid.map.gz
Now the third column is the taxid, let's double check on ncbi what species this taxid belongs to NCBI taxonomy id. Is this correct? If so, then let's map the reads up against it, but first we build a new small database to map against it.
index the genome
bowtie2-build MF770243.1.fa MF770243.1.fa
map reads againt it
for file in *vs.fq.gz
do
bowtie2 -U ../PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz -x NC_001941.1.fa --no-unal --threads 5 | samtools view -bS - > PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.bam
done &> ovis_map.log
Lets check depth of each position
samtools depth PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam
we can now cut out the coverage per position
samtools depth PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam | cut -f3
samtools depth PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam | cut -f3 | datamash mean 1
if you forget to set the -a option the depth it will be depth per covered base, which is the correct way to report this.
samtools depth -a PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam | cut -f3 | datamash mean 1
In this tutorial we will investigate how the ancient bear DNA we retrieved from our environmental samples relates to modern black bear populations. For this, we use a pre-generated dataset of SNP genotype data for a modern bear reference panel with 79 individuals from five black bear populations, as well as two polar bear samples. Our three ancient environmental bear samples are represented by pseudo-haploid genotypes, which we generated by selecting a random high-quality allele for SNP positions in the reference panel which were covered by mapped sequencing reads in the respective sample.
First, we activate our environment and install some R
packages using mamba
conda activate env3
mamba install r-tidyverse r-ggrepel
now let's set up a working folder and copy the data
mkdir popgen
cd popgen
cp ~/course/data/day3/popgen/* .
The dataset is provided in PLINK
binary format (https://www.cog-genomics.org/plink/), a program widely used for SNP genotype data managment and analysis. The dataset is composed of three files
modern_polar_mexican.bed
modern_polar_mexican.bim
modern_polar_mexican.fam
You can learn more about the different file formats here https://www.cog-genomics.org/plink/1.9/formats
As an example of how we can use PLINK
, the following command calculates some basic missing data summaries:
plink --bfile modern_polar_mexican --missing --out modern_polar_mexican
One of the resulting output files
modern_polar_mexican.imiss
contains per-individual missing genotype rate (6th column F_MISS
).
Examine the file and find the entries for the three Mexican samples. How much missing data do they show? Would you expect to be able to obtain meaningful results with this amount of data?
We will now try to make sense of our data by carrying out two types of analyses. In the first part, we will use principal component analysis (PCA) for a first exploration of the structure in our dataset. Following this, we will use the so-called 'f-statistic' framework to perform more in-depth statistical analyses and test different hypotheses regarding the relationship of the ancient bears with the different modern populations.
We will use the smartpca
program from the EIGENSOFT
package (https://github.com/DReichLab/EIG), a widely used tool for PCA on genotype data. It has a number of features useful for the analysis of ancient DNA data, in particular the option to "project" individuals with poor quality data / high missingness onto principal components inferred from a set of high quality reference panel.
In order to carry out the smartpca
analysis, we need to prepare a file setting the parameters for the analysis. Below is an example parameter file modern_all.smartpca.par
genotypename: modern_polar_mexican.bed
snpname: modern_polar_mexican.bim
indivname: modern_polar_mexican.fam
evecoutname: modern_all.evec
evaloutname: modern_all.eval
familynames: NO
numoutevec: 20
numthreads: 2
numoutlieriter: 0
poplistname: modern_all.pops.txt
lsqproject: YES
pordercheck: NO
autoshrink: YES
The first five lines specify the input and output data, respectively. Other important parameters included are
numoutevec
- the number of princiapl components to be returned
poplistname
- name of a file containing the population IDs for samples to be used to infer the principal components. When using PLINK
format, population IDs for individuals have to be specified in column 6 of the .fam
file. All individuals in the dataset with population ID not included in the file will be projected onto the inferred components
lsqproject
- PCA projection algorithm appropriate for samples with high amount of missing data.
To run pca, we use the command
smartpca -p modern_all.smartpca.par | tee modern_all.smartpca.log
which will run smartpca
and print its log messages both to stdout
and to a file modern_all.smartpca.log
. Once the run is complete, the PCA coordinates for each sample (i.e. the eigenvectors) can be found in the output file modern_all.evec
.
To visualize the PCA results, we can use the provided R
script plot_pca.R
, which outputs simple bi-plots of all PCs in the output file:
Rscript plot_pca.R modern_all.evec label_inds.txt modern_all.pdf
The script takes three command line arguments:
- the filename for the
smartpca
eigenvector results - a file with a list of IDs for samples to be highlighted in the plot
- the filename for the output file in
pdf
format
The analysis with the parameter file modern_all.smartpca.par
outlined here performs a PCA projecting the ancient bear samples onto the full set of modern reference samples, including both black bears and polar bears. Explore the output and answer some of the following questions:
- Which populations are separated on the first few PCs?
- Where do the ancient bear samples fall, and how can we interpret their position?
- Is there any difference in PCA positions between the three bear samples, and if so, what could be the interpretation?
Once you have explored these results and familiarized yourself with the analysis, you can run PCA on other subsets of the data. A parameter file and corresponding poplistname
file for a subset excluding the two polar bears is provided as modern_blackbear.smartpca.par
. Explore the output of that and try to answer some of the same quesions as above. If you wish to explore other subsets, you can create your own poplistname
file with your populations of interest, and create a corresponding smartpca
parameter file to run the analysis
After our exploratory analysis using PCA, we would like to dive a bit deeper into the population genetics of our dataset. For this we will use the f-statistic framework, implemented in the R
package admixtools
. We activate the R
environment and install some needed packages
conda activate r
mamba install bioconductor-ggtree r-ape
An R
script with an example analysis is provided in the file f_statistics.R
. You can open this file in VS Code Studio, and use an interactive R
terminal to work through the commands