-
Notifications
You must be signed in to change notification settings - Fork 1
Day 3
The output from metaDMG now counts some basic stats that you would want to extract for your project but also for potential future publications or like, such as the total number of reads classified to any taxonomic level, total number of reads classified within a superkingdom and like. This can be obtained from the files in the data/lca/ folder and using simple text parsing comments that you have already been using in the course first two days ('wc -l' or 'grep -c'). Please collect total number of reads classified per sample, total number of reads classified to plants per sample and total reads classified to animals per sample.
Please also ensure you have performed the last step in the tutorial from yesterday, e.g. converting the metaDMG output to a .tsv file. Today we will explore a few examples of how to visualize such data in R.
Examples/starter code could be:
library(gghighlight)
dat100_filt |>
ungroup() |>
inner_join(metadata) |>
filter(fit == "good") |>
ggplot() +
geom_jitter(aes(x=as.numeric(years_bp), y=A_b, size = n_reads), alpha =0.5) +
gghighlight(n_reads > 500) +
xlab("Years BP") +
ylab("DNA damage") +
labs(color = "Values for taxa with \n>500 reads", size = "Number of reads")
In the above one could play around with filtering for certain groups of organisms say "plants" for instance. This can be done by adding filter(grepl("Viridiplantae", taxa_path)). Try a different plot type of your own.
We can also plot the taxa identified, but best would be to convert the read counts to percentage or another normalized value.
dat100_filt |>
filter(grepl("Eukaryota", taxa_path), fit == "good") %>% # Filter rows where taxa_path contains "Viridiplantae"
inner_join(metadata) |>
group_by(label) %>% # Group by label
mutate(percentage = n_reads / sum(n_reads) * 100) %>% # Calculate percentage
ungroup() |>
filter(percentage > 0.1) |>
ggplot(aes(y = name, x = years_bp, fill = percentage)) +
geom_tile() +
labs(x = "Label", y = "Taxa") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) # Rotate x-axis labels
Reflect on what are you observing in the different plots, are there taxa that makes sense for an irish mid-late Holocene environment. If you are not familiar with the scientific names try searching GBIF, please note the g__ is an addition we have made to the name and needs to be removed prioir to search.
Collect plots, reflect on what you are observing. Start thinking of how you can present your groups results, thoughts etc on friday afternoon.
Euka - a tetrapodic and arthropodic taxa detection from modern and ancient environmental DNA using pangenomic reference graphs
To run euka we first need to download the graph database:
cd ~/course/wdir/day3
mkdir euka_dir
wget -nc -l1 --no-parent -P . ftp://ftp.healthtech.dtu.dk:/public/euka_files/*
This can take a few minutes, please be patient. Once everything is downloaded we can run euka. With --euka_dir
we define the path to the directory containing our graph database. -o
is your chosen output prefix and --outFrag
produces a list of all read names per taxon (we need this for later).
You can always check out all available command options for euka with the following command:
~/course/data/vgan/bin/vgan euka
cd ~/course/wdir/day3/
~/course/data/vgan/bin/vgan euka -fq1 path/to/your/input/fastq --euka_dir ~/course/wdir/day3/euka_dir/ -o [output prefix] --outFrag -t 5
If you would like to input more than one fastq file to be analysed together please use a file descriptor:
~/course/data/vgan/bin/vgan euka -fq1 <(zcat path/to/your/input/fastq1 path/to/your/input/fastq2) --euka_dir ~/course/wdir/day3/euka_dir/ -o [output prefix] --outFrag
euka outputs a few different files. The first look should be in the [output prefix]_detected.tsv file, which provides an overview of all detected taxa.
- [output prefix]_abundance.tsv file provides an overview of all taxa in the database and how many reads have mapped to it.
- [output prefix]_coverage.tsv file shows the number of reads mapped to every bin of a detected taxon
- [output prefix]_inSize.tsv file shows the fragment length of each read assigned to a detected taxa
- [output prefix]_FragNames.tsv file shows the read names of each read assigned to a detected taxa
- For each detected taxa you get a .prof file which contains the damage profil for the reads of that taxon - you can have a look if its ancient. All these results can be plotted using provided plotting scripts. Please be aware that these script need specific R packages installed, to be able to plot you need to create a conda environment with the provided euka.yml file:
conda env create -f ~/course/data/vgan/bin/euka.yml
conda activate euka
Now, we can plot:
~/course/data/vgan/share/vgan/plottingScripts/visualize_detected_taxa.sh [output prefix]
soibean takes a fastq as input that is filtered to only contain reads that belong to one taxon. After detecting taxa with euka you can extract all reads from one detected taxon.
For this we need the tool seqtk
- it should be available in provided conda env2 or mapping. Otherwise, it would need to be installed using conda.
less -S [output prefix]_FragNames.tsv | sed '[row number]!d' | sed 's/\t/\n/g' | seqtk subseq [path/to/your/input/fastq] - | gzip > [output].fq.gz
The extracted fastq can be used to identify the contributing species with soibean. Again we first need to download the database:
cd ~/course/wdir/day3/
mkdir soibean_dir
cd soibean_dir
wget -nc --no-parent -P . ftp://ftp.healthtech.dtu.dk:/public/soibean_files/*
cp -r ~/course/data/vgan/share/vgan/soibean_dir/tree_dir/ tree_dir
soibean does not work on the entire soibean_db.og graph, but only on a user chosen subgraph. We need to extract a subgraph for a detected taxon. For a list of detected taxa and the correct name to use, have a look in the [output prefix]_detected.tsv
file from euka.
To extract a subgraph we need to go through the following steps:
cd ~/course/wdir/day3/soibean_dir/
wget -O make_graph_files.sh https://raw.githubusercontent.com/grenaud/vgan/refs/heads/main/share/vgan/soibean_dir/make_graph_files.sh
chmod u+x make_graph_files.sh
./make_graph_files.sh [taxon name]
Once the script has run you can check your directory with ls
and you should see multiple graph files that start with your taxon name. Now, we can run soibean.
Again the --soibean_dir
option defines the path to our graph file, or more specifically the path to your newly created subgraph files for your choice of taxon.
--tree_dir
specifies the directory where all the phylogenetic guide trees for each taxon are.
--dbprefix
helps soibean identify the subgraph and all additional files for your chosen taxon.
You can always check out all available command options for soibean with the following command:
~/course/data/vgan/bin/vgan soibean
To run soibean we use the following command:
cd ~/course/wdir/day3/
~/course/data/vgan/bin/vgan soibean -fq1 path/to/your/taxon/input/fastq --soibean_dir ~/course/wdir/day3/soibean_dir/ --tree_dir ~/course/wdir/day3/soibean_dir/tree_dir/ --dbprefix [taxon name] -o [output prefix] -t 10
Please know the MCMC used in soibean does a lot of calculations and is therefore quite slow especially with larger input fastqs. To save some time you can limit the number of iterations the MCMC runs, but that will cause high higher level of uncertainty in the prediction and likely cause the MCMC statistics to less optimal.
~/course/data/vgan/bin/vgan soibean -fq1 path/to/your/taxon/input/fastq --soibean_dir ~/course/wdir/day3/soibean_dir/ --tree_dir ~/course/wdir/day3/soibean_dir/tree_dir/ --dbprefix [taxon name] -o [output prefix] -t 10 --iter 110 --burnin 10 --chain 1 -k 2
soibean outputs a seperate file per chain per k (estimated number of source) the endings of each file are a combination of k and the chain number. For example the beanOut_Results13.mcmc file is for one source (k = 1) and chain number 3 (c = 3).
- [output prefix]Resultskc.mcmc contains all accepted moves of the MCMC
- [output prefix]Tracekc.detail.mcmc contains all accepted and rejected moves of the MCMC
- [output prefix]Diagnosticsk0.txt contains MCMC diagnostics about the convergence
- [output prefix]ProportionEstimatek.txt contains diagnostics about the convergence of the proportion parameter
- [output prefix]BranchEstimatek.txt contains diagnostics about the convergence of the branch position parameter
At the end of the day you can set soibean to run with increased iterations and burnin, try --iter 10000 --burnin 1500. But note the more iterations and burnin cycles you add the longer the processing and computational load.
You can plot soibeans results with two different scripts. Again please create and activate the conda environment first:
conda env create -f ~/course/data/vgan/bin/soibean.yml
conda activate soibean
Rscript ~/course/data/vgan/share/vgan/plottingScripts/soibeanPlotk.R [output prefix]
Rscript ~/course/data/vgan/share/vgan/plottingScripts/soibeanPlots.R ~/course/data/soibean_dir/tree_dir/[taxon name].new.dnd [output prefix]Resultskc.mcmc
For more details and command options please find a complete README.md for both euka and soibean here: https://github.com/grenaud/vgan ...
The goal of this part of the tutorial is to place a consensus mitogenome of the elephantid reads into a phylogeny. To do this, you will extract the elephantid reads, re-map them to a reference mitogenome, call a consensus mitogenome, and place the mitogenome into a phylogenetic context.
conda activate env2
mkdir ~/course/wdir/phylogenetic_placement
cd ~/course/wdir/phylogenetic_placement
Extract a list of elephantid read IDs from the lca file in the 'mapping' working directory. Update the name of the '.filtered.metaDMG.bam.lca.gz' file as appropriate.
zgrep 'Elephantidae' ~/course/wdir/mapping/PRI-TJPGK-CATN-26-28.filtered.metaDMG.bam.lca.gz | cut -f1 > PRI-TJPGK-CATN-26-28.elephantidae.readID.txt
Check how many read IDs were extracted
wc -l PRI-TJPGK-CATN-26-28.elephantidae.readID.txt
Extract the elephantid-assigned reads in fastq format
seqtk subseq ~/course/data/day1/fastq/PRI-TJPGK-CATN-26-28.fq.gz PRI-TJPGK-CATN-26-28.elephantidae.readID.txt | gzip > PRI-TJPGK-CATN-26-28.elephantidae.fq.gz
Check how many reads were extracted. The count should be same as read ID count calculated above.
zcat PRI-TJPGK-CATN-26-28.elephantidae.fq.gz | awk '{s++}END{print s/4}'
Retrieve the woolly mammoth reference mitochondrial genome from NCBI
mkdir ~/course/wdir/reference_mitogenomes
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_007596.2&rettype=fasta" -O ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta
Check that the woolly mammoth reference mitochondrial genome downloaded ok
head ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta
Build a bowtie2 reference of the woolly mammoth mitochondrial genome. Note that bowtie2 will output the index files to the current working directory, so we need to change directory, build the reference, and then change back
cd ~/course/wdir/reference_mitogenomes
bowtie2-build Mammuthus_primigenius_mtDNA.fasta Mammuthus_primigenius_mtDNA.fasta
cd ~/course/wdir/phylogenetic_placement
Check that the build worked. There should be six .bt2 files.
ls ~/course/wdir/reference_mitogenomes
Map the elephantid-assigned reads to the woolly mammoth mitochondrial genome. Here, we are using a map-quality cut-off of 20.
bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 20 -bS - > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.bam
The standard output gives the bowtie2 alignment statistics. How many reads were not aligned with bowtie2?
Check the count of reads in the final bam file using samtools
samtools flagstat PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.bam | head -1
Why is there a discrepancy between the aligned read count from samtools and bowtie2?
First, we need to sort the bam file using samtools
samtools sort PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.bam > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam
For fun, we can calculate the mean depth of coverage for the bam file using samtools
samtools depth -a PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam | cut -f3 | datamash mean 1
Calling the consensus using angsd. As angsd may give a shared library load error (you can check), we can reinstall from conda. If doing this, hit 'y' when prompted.
conda install bioconda::angsd
angsd -dofasta 2 -docounts 1 -minmapq 20 -minq 25 -i PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam -out PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort
gzip -d PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa.gz
Now edit the fasta header and rename it sample26_mamPri_mq20. We will use the nano text editor. Install and hit 'y' when prompted.
conda install nano
nano PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa
Make the edit by replacing '>NC_007596.2' with '>sample26_mamPri_mq20' on the first line. Exit with 'Ctrl+X' followed by 'y' followed by 'enter'.
Download the 45 elephantid mitogenomes
wget "https://www.dropbox.com/scl/fi/5kit6qpxuvschmu3r4zy0/45_elephantidae.fa?rlkey=chidwctyt82yajn4cuh99qx20&dl=0" -O Elephantidae_45_mtDNA.fa
Merge the sample26 and 45 elephantid mitogenome fastas into a single file
cat Elephantidae_45_mtDNA.fa PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa > 45_elephantidae_sample26.fa
mafft --thread 5 45_elephantidae_sample26.fa > 45_elephantidae_sample26.mafft
First, we need to install raxml-ng. Hit 'y' when prompted during installation. The following command uses five threads for analytical speed, 200 bootstrap replicates for branch support, and sets the two American mastodon (Mammut americanum) sequences as outgroup. The --redo flag allows for file overwriting if the analysis needs to be re-run.
conda install raxml-ng
raxml-ng --all --msa 45_elephantidae_sample26.mafft --model GTR+G --prefix EleMT --threads 5 --seed 2 --bs-trees 200 --outgroup "ENA|EF632344|EF632344.1.Mammut.americanum.mitochondrion.complete.genome.","ENA|KY364233|KY364233.1.Mammut.americanum.mitochondrion.complete.genome." --redo
The key output file is 'EleMT.raxml.support'. If we add the '.tre' extension, then this file can be opened by a variety of tree editors
cp EleMT.raxml.support EleMT.raxml.support.tre
The EleMT.raxml.support.tre tree file can be downloaded and visualized on http://etetoolkit.org/treeview/ or with FigTree (https://github.com/rambaut/figtree/releases)
Here we use a higher map quality cut-off (mq30) and align to a second reference mitogenome (African elephant) to test the consistency of our results.
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000934.1&rettype=fasta" -O ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta
head ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta
cd ~/course/wdir/reference_mitogenomes
bowtie2-build Loxodonta_africana_mtDNA.fasta Loxodonta_africana_mtDNA.fasta
cd ~/course/wdir/phylogenetic_placement
ls ~/course/wdir/reference_mitogenomes
bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 30 -bS - > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.bam
bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 20 -bS - > PRI-TJPGK-CATN-26-28.Loxodonta_africana.bam
bowtie2 --threads 5 -x ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta -U PRI-TJPGK-CATN-26-28.elephantidae.fq.gz --no-unal | samtools view -q 30 -bS - > PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.bam
samtools flagstat PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.bam | head -1
samtools flagstat PRI-TJPGK-CATN-26-28.Loxodonta_africana.bam | head -1
samtools flagstat PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.bam | head -1
samtools sort PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.bam > PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam
samtools sort PRI-TJPGK-CATN-26-28.Loxodonta_africana.bam > PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam
samtools sort PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.bam > PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam
samtools depth -a PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam | cut -f3 | datamash mean 1
samtools depth -a PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam | cut -f3 | datamash mean 1
samtools depth -a PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam | cut -f3 | datamash mean 1
angsd -dofasta 2 -docounts 1 -minmapq 30 -minq 25 -i PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam -out PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort
gzip -d PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.fa.gz
angsd -dofasta 2 -docounts 1 -minmapq 20 -minq 25 -i PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam -out PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort
gzip -d PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.fa.gz
angsd -dofasta 2 -docounts 1 -minmapq 30 -minq 25 -i PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam -out PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort
gzip -d PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.fa.gz
nano PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.fa
Make the edit by replacing '>NC_007596.2' with '>sample26_mamPri_mq30' on the first line. Exit with 'Ctrl+X' followed by 'y' followed by 'enter'.
nano PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.fa
Make the edit by replacing '>NC_007596.2' with '>sample26_loxAfr_mq20' on the first line. Exit with 'Ctrl+X' followed by 'y' followed by 'enter'.
nano PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.fa
Make the edit by replacing '>NC_007596.2' with '>sample26_loxAfr_mq30' on the first line. Exit with 'Ctrl+X' followed by 'y' followed by 'enter'.
cat Elephantidae_45_mtDNA.fa PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.fa PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.fa PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.fa PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.fa > 45_elephantidae_sample26_v2.fa
mafft --thread 5 45_elephantidae_sample26_v2.fa > 45_elephantidae_sample26_v2.mafft
raxml-ng --all --msa 45_elephantidae_sample26_v2.mafft --model GTR+G --prefix EleMT_v2 --threads 5 --seed 2 --bs-trees 200 --outgroup "ENA|EF632344|EF632344.1.Mammut.americanum.mitochondrion.complete.genome.","ENA|KY364233|KY364233.1.Mammut.americanum.mitochondrion.complete.genome." --redo
cp EleMT_v2.raxml.support EleMT_v2.raxml.support.tre
The EleMT_v2.raxml.support.tre tree file can be downloaded and visualized on http://etetoolkit.org/treeview/ or with FigTree (https://github.com/rambaut/figtree/releases)
Are there any differences in the mitogenome branch lengths and their phylogenetic positions? Why might there be these differences?
We can run mapDamage as an alternative way of visualizing DNA damage.
conda activate env3
mapDamage -i PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.sort.bam -r ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta -l 150 --merge-reference-sequences --no-stats -d ./mapDamage_PRI-TJPGK-CATN-26-28_mamPri_mq20 -y 0.5 -m 25 -t Elephantidae_PRI-TJPGK-CATN-26-28_mamPri_mq20
mapDamage -i PRI-TJPGK-CATN-26-28.Mammuthus_primigenius.mq30.sort.bam -r ~/course/wdir/reference_mitogenomes/Mammuthus_primigenius_mtDNA.fasta -l 150 --merge-reference-sequences --no-stats -d ./mapDamage_PRI-TJPGK-CATN-26-28_mamPri_mq30 -y 0.5 -m 25 -t Elephantidae_PRI-TJPGK-CATN-26-28_mamPri_mq30
mapDamage -i PRI-TJPGK-CATN-26-28.Loxodonta_africana.sort.bam -r ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta -l 150 --merge-reference-sequences --no-stats -d ./mapDamage_PRI-TJPGK-CATN-26-28_loxAfr_mq20 -y 0.5 -m 25 -t Elephantidae_PRI-TJPGK-CATN-26-28_loxAfr_mq20
mapDamage -i PRI-TJPGK-CATN-26-28.Loxodonta_africana.mq30.sort.bam -r ~/course/wdir/reference_mitogenomes/Loxodonta_africana_mtDNA.fasta -l 150 --merge-reference-sequences --no-stats -d ./mapDamage_PRI-TJPGK-CATN-26-28_loxAfr_mq30 -y 0.5 -m 25 -t Elephantidae_PRI-TJPGK-CATN-26-28_loxAfr_mq30
Do you see any differences between the four plots? Why might there be these differences?
Unfortunately we couldn't get quicksand to run on the virtual machines, so I ran quicksand on the test samples, and then copied the output to $qsdir
.
conda activate env2
conda install kallisto=0.50
qsdir=~/course/data/day3/results/quicksand
refdir=~/course/data/day3/db/refseq_mtdna_genomes_byfam
ls $qsdir
ls $refdir | head
find $qsdir | head
Refseq mtDNA genomes are in $refdir, organized by Family.
Quicksand output is also organized by Family:
ls $qsdir/out
less -S $qsdir/final_report.tsv
awk 'NR == 1 || ($22 == "++" && $24 > 500)' $qsdir/final_report.tsv | less
ancient_fams=$(awk '$22 == "++" && $24 > 500' $qsdir/final_report.tsv | cut -f11)
echo $ancient_fams
mkdir -p ~/course/wdir/kal
cd ~/course/wdir/kal
for fam in $ancient_fams ; do
b=$qsdir/out/$fam/1-extracted/PRI-TJPGK-CATN-26-28.fq_extractedReads-$fam.bam
samtools fastq $b | gzip -c > $fam.reads.fq.gz
done
for fam in $ancient_fams ; do
ls $refdir/$fam.fasta.gz
zcat $refdir/$fam.fasta.gz | grep -c '>'
done
for fam in $ancient_fams ; do
ls $refdir/$fam.fasta.gz
kallisto index -i $fam.k21.idx -k21 $refdir/$fam.fasta.gz
done
for fam in $ancient_fams ; do
kallisto quant -o $fam.kal --single -l50 -s10 -i $fam.k21.idx $fam.reads.fq.gz
done
sort -k5,5nr Hominidae.kal/abundance.tsv
for fam in $ancient_fams ; do
echo
echo $fam
sort -k5,5nr $fam.kal/abundance.tsv | awk '{print $4, $1}' | head
done
kallisto index -i hominin_mtdna.noest.fa.k21.idx -k21 $refdir/../hominin_mtdna.noest.fa
kallisto quant -o hominin_mtdna.noest.kal --single -l50 -s10 -i hominin_mtdna.noest.fa.k21.idx Hominidae.reads.fq.gz
Now that we have produced/found out likely ancient dataset for taxa showing damage with mean read length > 35 and also verified a few animal taxa in the profiles we could go back and use the Rscripts for the read length plotting we used yesterday and replace the overall super kingdoms with taxa that would be interesting and or necessary to have further verifications for.
OBS needs reporting of the finding above and extraction of the reads for depth of coverage and phylo placement.
conda activate env1
Nuclear and mito genomes can be downloaded from any source here we use NCBI.
Now lets check the accession number/ID in the header of the fasta
head -1 MF770243.1.fa
And then check that the accession to taxid matches in the ncbi taxonomy
zgrep MF770243 ../../../data/shared/taxonomy/acc2taxid.map.gz
Now the third column is the taxid, let's double check on ncbi what species this taxid belongs to NCBI taxonomy id. Is this correct? If so, then let's map the reads up against it, but first we build a new small database to map against it.
index the genome
bowtie2-build MF770243.1.fa MF770243.1.fa
map reads againt it
for file in *vs.fq.gz
do
bowtie2 -U ../PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz -x NC_001941.1.fa --no-unal --threads 5 | samtools view -bS - > PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.bam
done &> ovis_map.log
Lets check depth of each position
samtools depth PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam
we can now cut out the coverage per position
samtools depth PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam | cut -f3
samtools depth PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam | cut -f3 | datamash mean 1
if you forget to set the -a option the depth it will be depth per covered base, which is the correct way to report this.
samtools depth -a PRI-TJPGK-CATN-224-226.fq.gz.vs.fq.gz.ovis.sort.bam | cut -f3 | datamash mean 1