-
Notifications
You must be signed in to change notification settings - Fork 4
Day 3
The output from metaDMG now counts some basic stats that you would want to extract for your project but also for potential future publications or like, such as the total number of reads classified to any taxonomic level, total number of reads classified within a superkingdom and like. This can be obtained from the files in the data/lca/ folder and using simple text parsing comments that you have already been using in the course first two days ('wc -l' or 'grep -c'). Please collect total number of reads classified per sample, total number of reads classified to plants per sample and total reads classified to animals per sample.
Please also ensure you have performed the last step in the tutorial from yesterday, e.g. converting the metaDMG output to a .csv file. Today we will explore a few examples of how to visualize such data in R.
First activate r environment and R,
conda activate r
R
Next, load the libraries needed and the data, if you are missing some libraries you can install these by install.packages("name.of.library")
library(tidyverse)
library(reshape2)
library(vegan)
library(rioja)
library(ggplot2)
library(dplyr)
library(gghighlight)
First, set working directory
setwd("~/course/wdir/mapping/plots/")
Import data
df <- read_csv("~/course/wdir/mapping/metaDMGresults.csv")
Count number of columns in dataframe, just to familiarize your self with it.
ncol(df)
Print all sample names
unique(df$sample)
Import metadata
metaDATA <- read.delim("~/course/data/shared/metadata/metadata.tsv")
Replace header "sample_name" with "sample"
colnames(metaDATA)[colnames(metaDATA) == "sample_name"] <- "sample"
colnames(metaDATA)[colnames(metaDATA) == "years_bp"] <- "YearsBP"
Merge the metadata and dataframe by sample
dt <- merge(df, metaDATA, by = "sample")
Now check that new columns have been added to the dt dataframe
ncol(df) < ncol(dt)
Print all different ages
unique(dt$YearsBP)
Print all column names of the table
colnames(dt)
Maximum amount of damage (filtered for)
DamMin2 = 0.00
MAP_Significance
MapSig2 = 0
Minimum reads for parsing taxa
MinRead2 = 100
Minimum mean read length
MinLength = 35
Subsetting the table by plant (Viridiplantae) genus using grepl and filter, parameters you need to set and possible add more?? Please think about how your group think it should be done. But first would probably be best to explore the data with less stringent filters, and plot these, then later on the basis of the data make decisions on how you want it filtered.
dt2 <- dt %>% filter(MAP_damage > DamMin2, N_reads >= MinRead2, mean_L > MinLength, MAP_significance > MapSig2, grepl("Viridiplantae",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", sample))
Now plot your data, you can add other variables in the gghighlight to illustrate different cut-offs, or other types of values etc.
pdf(file = "aeCourse.DNAdamageModelJitterPlot.pdf", width = 8, height = 4)
ggplot() +
geom_jitter(data = dt2, aes(x=as.numeric(YearsBP), y=MAP_damage, size = N_reads), alpha =0.5) +
gghighlight(N_reads > 500) +
xlab("Years BP") +
ylab("DNA damage") +
labs(color = "Values for taxa with \n>500 reads", size = "Number of reads")
dev.off()
Plot plant taxa, highlight taxa with more than 500 reads and add the min, max and median. save as pdf
pdf(file = "aeCourse.DNAdamageLRJitterPlot.pdf", width = 8, height = 4)
ggplot() +
geom_jitter(data = dt2, aes(x=as.numeric(YearsBP), y=MAP_damage, size = MAP_significance), alpha =0.5) +
gghighlight(N_reads > 500) +
xlab("Years BP")+
ylab("DNA damage") +
labs(color = "Values for taxa with \n>500 reads", size = "Significance \nfor Taxa with >500 reads")
dev.off()
Create filtered table for DNA damage model
filtered_data <- dt2 %>% filter(N_reads >= 500)
Now you should make decisions on how the data should be filtered. You can use the
MapSig3 = 3
MinRead3 = 100
MinLength3 = 35
Subsetting the table using grepl and filter, parameters you need to set and possible add more?
filtered_data_metazoan <- dt %>% filter(N_reads >= 10, mean_L > MinLength3, MAP_significance > MapSig3, grepl("Metazoa",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
unfiltered_data_metazoan <- dt %>% filter(N_reads >= 2, mean_L > MinLength3, MAP_significance > 1, grepl("Metazoa",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
filtered_data_viridiplantae <- filtered_data %>% filter(N_reads >= 100, mean_L > MinLength3, MAP_significance > LR3, grepl("Viridiplantae",tax_path), grepl("\\bgenus\\b", tax_rank), grepl("", YearsBP))
unique(filtered_data_viridiplantae$YearsBP)
You can also make a smaller table with only values of your choice example below.
select(filtered_data_viridiplantae, tax_name, MAP_damage, MAP_significance, N_reads, YearsBP)
Count number of unique plant taxa
unique(filtered_data_viridiplantae$tax_name)
Make wide table for downstream plot and data wrangling of the plants.
data_wide_plants <- dcast(filtered_data_viridiplantae, tax_name ~ YearsBP, value.var="N_reads", fun.aggregate = sum)
n <- ncol(data_wide_plants)
b2 <- data_wide_plants[,2:n]
rownames(b2) <- data_wide_plants$tax_name
b2[is.na(b2)] <- 0 #set all NAs as zeros
Prints sum of samples and taxa and depths names
colSums(b2)
rowSums(b2)
colnames(b2)
Test, if this one fails it might have text in the number of reads coloumn
b2[is.na(b2)]=0
Create percentage table, by taking the number of columns and divides these with sum of all reads in the column, remove the -1 and you will plot the Nreads column on later plots
i=ncol(b2)
b3=as.matrix(b2[,seq(1,i)])
b4 <- prop.table(data.matrix(b3), margin=2)*100 # makes proportion table, needs 2 margins e.g. header and 1st row names
colSums(prop.table(b4, margin=2)*100) # prints sum of column, should give 100 for each
Next we will transpose the table, for plotting it as a strat.plot
b5 <- t(b4)
z <- as.numeric(rownames(b5)) # depth/depth
and plot it on a stratigraphic plot (typical pollen type plot)
pdf(file = "aeCourse.Stratplot_Plants_area.pdf", width = 15, height = 5)
strat.plot(b5, y.rev=TRUE, plot.line=TRUE, plot.poly=TRUE, plot.bar=FALSE, lwd.bar=10, sep.bar=TRUE, scale.percent=TRUE, xSpace=0.01, x.pc.lab=TRUE, x.pc.omit0=TRUE, srt.xlabel=45, las=2, exag=TRUE, exag.mult=5, ylabel = "years BP", yvar = z)
dev.off()
Now we will convert the wide table into a long table format to plot it with ggplot
y <- ncol(b5)
b6 <- melt(b5[,1:y])
sapply(b6, class)
colnames(b6) <- c("YearsBP","Taxa", "percentage")
p1 <- ggplot(b6, aes(y=Taxa, x=YearsBP, fill=percentage)) + geom_tile(colour="lightgrey") +
theme_minimal() + scale_fill_gradient(low="white", high="darkgreen") + scale_y_discrete(limits=rev)
p1 + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust =1)) + ggtitle("percentage of taxa plotted as heatmap") +
xlab("YearsBP") + ylab("YearsBP") + labs(fill = "percentage %")
These were just a few examples, it is possible to replace values and plot data in a wide variety of ways. Which you need and will use depends on your data, your study and the story you want to tell.
Euka - a tetrapodic and arthropodic taxa detection from modern and ancient environmental DNA using pangenomic reference graphs
If you are looking for other available options just run ~/course/data/vgan/bin/vgan euka for the manual.
Within the vgan folder you you can find a folder called bin where the vgan executable can be found. So, to run euka you can just use the following command:
~/course/data/vgan/bin/vgan euka -fq1 <(zcat ~/course/wdir/mapping/PRI-TJPGK-CATN-96-98.fq.gz.vs.fq.gz) -o PRI-TJPGK-CATN-96-98 -t 5
If you want to input multiple fastq files at once, please use a file descriptor otherwise euka will not recognize all the files. For example:
or
~/course/data/vgan/bin/vgan euka -fq1 <(zcat ~/course/wdir/mapping/*vs.fq.gz) -o all_samples -t 5
Now that we have produced/found out likely ancient dataset for taxa showing damage with mean read length > 35 and also verified a few animal taxa in the profiles we could go back and use the Rscripts for the read length plotting we used yesterday and replace the overall super kingdoms with taxa that would be interesting and or necessary to have further verifications for.
Print sample list based on name files
ls *lca.txt > sample.list
nano name_genus
> save it as > 'taxa.list'
while read -r line
do
arr=($line)
lib=${arr[0]}
echo $lib
cat taxa.list | parallel -j20 "grep {} $lib | cut -f1,2,3,4,5,6,7 -d: > $lib.{}.readID.txt"
done < sample.list
wc -l *.readID.txt| awk '$1 == 0' | awk '{print $2}' > rm.list
cat rm.list | parallel -j5 "rm {}"
wc -l *.readID.txt| paste > tot_genus_sequences.txt
for infile in *readID.txt
do
bname=$(basename $infile)
bname1=$(echo $bname | sed 's/.readID.txt*/.fq/')
bname2=$(echo $bname | cut -f1 -d. )
echo $bname1
echo $bname2
seqtk subseq path_to_files_fq/adap2_kmer2_*$bname2*.fq.pp.rmdup.fq $bname > $bname1 &
done
Download reference mitochondrial genomes from NCBI or directly on NCBI Nucleotide database
Send to > File > Fasta > Create File
The samples used here were based on paper Taylor et al. 2021:Taylor, William TT, et al. "Evidence for early dispersal of domestic sheep into Central Asia." Nature Human Behaviour 5.9 (2021): 1169-1179.)
See ReferenceTable1.xlsx
sed "s/:/_/g" in.fa > out.fa
sed "s/ /_/g" in.fa > out.fa
sed "s/;/_/g" in.fa > out.fa
sed "s/__/_/g" in.fa > out.fa
sed "s/(/./g" in.fa > out.fa
sed "s/)/./g" in.fa > out.fa
cat *.fasta > in_to_align.fa
mafft --thread 5 --leavegappyregion Ovis_aries_musimon2.fasta > Ovis_aries_musimon2_aligned.fa
in Geneious (if available): Tools > Generate Consensus Sequence > majority base calling
Or use Python BioAlign's consensus function (python) - biophyton needs to be installed: conda install -c conda-forge biopython
python /path_to_Script/get_consensus.txt Reference_aligned.fa Reference_output_Consensus.fa
for file in *.fa
do
bn=`basename $file .fa`
awk '/^>/ {gsub(/.fa(sta)?$/,"",FILENAME);printf(">%s\n",FILENAME);next;} {print}' $file > ${bn}_renamed.fa ; rm $file
done
cat *Consensus.fa > MSA_to_align.fa
mafft --thread 5 --leavegappyregion MSA_to_align.fa > MSA_out_aligned.fa
raxml-ng --msa MSA_out_aligned.fa --model GTR+G --prefix output_name --threads 2
If any warnings, run the '*reduced.phy' files to get a clean Tree.
for file in *.fa
do
bname=$(basename $file)
bowtie2-build -f $file DB_$bname
done
for file in $(pwd)/*.fq
do
bname=$(basename $file)
echo $bname
bname2=$(echo $bname | cut -f1-3 -d_)
echo $bname2
bname3=$(echo $bname | sed 's/.fq.pp.rmdup.fq*/_holi/')
basepath=$(pwd)/
basefolder=$basepath
echo $basepath
echo $bname3
mkdir $basepath$bname3
cd $basepath$bname3
done
for DB in /path_to_DB/name_DB
do
echo Mapping adap2_kmer2_$bname.pp.rmdup.fq against $DB
bowtie2 --threads 5 -x $DB -U $file --no-unal | samtools view -bS - > $bname2.$(basename $DB).bam
done
for DB in /path_to_DB/name_DB
do
echo Mapping adap2_kmer2_$bname.pp.rmdup.fq against $DB
bowtie2 --threads 5 -x $DB -U $file --no-unal | samtools view -bS - > $bname2.$(basename $DB).bam
done
for bam in *.bam
do
echo Sorting bam files
samtools sort -O BAM -o sort.$bam $bam
done
for sort in sort.*.bam
do
echo Bamcov $sort
/path_to_bamcov/bamcov -m -w0 $sort | paste > bamcov_hist_$sort.txt
/path_to_bamcov/bamcov $sort | paste > bamcov_table_$sort.txt
done
cd $basepath
done
Important to assess the coverage of the mapped reads to different mitogenomes before proceeding with the tree.
mkdir new_folder
for file in *holi/sort.*.bam
do
mv $file new_folder
done
for file in sort*.bam
do
angsd -doFasta 2 -doCounts 1 -i $file -out $file.consensus.fa
done
for file in *.consensus.fa.fa.gz
do
bname=$(basename $file)
echo $bname
gunzip $file
done
for file in *.consensus.fa.fa
do
bname=$(basename $file)
echo $bname
find $file -size 0 -delete
done
for file in *.fa.fa
do
bn=`basename $file .fa.fa`
awk '/^>/ {gsub(/.fa(sta)?$/,"",FILENAME);printf(">%s\n",FILENAME);next;} {print}' $file > ${bn}_renamed.fa ; rm $file
done
cat *renamed.fa MSA_not_aligned.fa > Allref_mycons.fa
mafft --thread 40 --leavegappyregion Allref_mycons.fa > Allref_mycons_aligned.fa
raxml-ng --msa output_name.raxml.reduced.phy --model GTR+G --prefix output_nameRED --threads 2
rerun using reduced.phy file if optimised alignment by raxml is suggested
We ran a tree in Beast based on the msa file for references and for references with sample files. Parameters used in BEAST (v.1.10.4) with 20,000,000 replicates, applying the HKY model and a Coalescent Constant Population as prior. BEAUTI > import alignment, set parameters, save file as Aln_mafft_taxa_references.xml
beast -threads n Aln_mafft_taxa_references.xml
Check log file TRACER > open .log file and check burnin, if not okay, increase replicates
TreeAnnotator → open .trees file and save it as a Maximum Credibility Tree with 10% Burnin when saving/naming the file you have to add the file extension .nexus
If using BEAST2 the tree will be saved with .tree extension that can be read by FIGTREE.
FIGTREE > Open tree.nexus > export tree as Newick > save with .nwk #Visualise node labels/branch labels as posterior values
Check .nwk file
Check with text editor if tree.nwk has any ‘ symbols, if so, remove them
to call those SNPs in a given dataset of ancient samples and find the best path and branch where these can be mapped in the tree.
snp-sites -v -c -o Aln_mafft_taxa_references.vcf Aln_mafft_taxa_references.fa
Rscript /path_to_Rscript/fix_vcf.R Aln_mafft_taxa_references.vcf Aln_mafft_taxa_references_output.vcf
awk '{ if ($1 == "1") $1="consensus";}1' Aln_mafft_taxa_references_output.vcf | sed 's/ /\t/g' > Aln_mafft_taxa_references_fixed_output.vcf
python /path_To_Script/get_consensus.txt Aln_mafft_taxa_references.fa Cons_Aln_mafft_taxa_references.fa
bwa index Cons_Aln_mafft_taxa_references.fa
mkdir map_to_cons
bwa aln -l 1024 -n 0.001 -t 10 /path_to_folder/Consensus_Mafft_All_references.fa /path_to_folder/library.fq
| bwa samse /path_to_folder/Consensus_Mafft_All_references.fa - /path_to_folder/library.fq
| samtools view -F 4 -q 25 -@ 10 -uS - | samtools sort -@ 10 -o library.sort.bam
samtools view -c file.bam
Create directory in pathPhynder_analysis folder
mkdir pathphynder_results
phynder -B -o /path_to_folder/branches.snp /path_to_folder/Mafft_All_reference.nwk /path_to_folder/Mafft_All_references_fixed_consensus.vcf
Prepare data - this will output a bed file for calling variants and tables for phylogenetic placement
pathPhynder -s prepare -i /path_to_folder/Mafft_All_reference.nwk -p taxa_pathphynder_tree -f /path_to_folder/branches.snp -r /path_to_folder/Consensus_Mafft_All_references.fa
pathPhynder -s all -t 5 -m transversions -i /path_to_folder/Mafft_All_references.nwk -p /path_to_folder/taxa_pathphynder_tree -l /path_to_folder/bamlist.txt -r /path_to_folder/Consensus_Mafft_All_references.fa
pathPhynder -s all -t 5 -i /path_to_folder/Mafft_All_references.nwk -p /path_to_folder/taxa_pathphynder_tree -l /path_to_folder/bamlist.txt -r /path_to_folder/Consensus_Mafft_All_references.fa
Visualise pdf files with the results.