Skip to content
Mikkel Winther Pedersen edited this page Sep 26, 2024 · 2 revisions

Ancient pathogens

Introduction

​ In this tutorial we will analyse ancient plague DNA. ​

Set up

​ We will map our putative ancient plague sequencing reads against the reference genomes of three different species of Yersinia ​

  • Yersinia pestis, the causative agent of plague (GCF000009065.1)
  • Yersinia pseudotuberculosis, the closest relative species of Y pestis and a rare, usually mild food-borne pathogen (GCF_000834295.1)
  • Yersinia intermedia, a more distantly related species commonly found in the environment (GCF_009730055.1) ​ first, some environment and program set up ​
conda activate mapping
mamba install bedtools
mkdir pathogens
cd pathogens
ln -s ~/course/data/day5/* .

​ We will use bowtie2 to map our sequencing reads to these reference genomes. To do so we first need to build an index for each reference ​

bowtie2-build GCF_000009065.1_ASM906v1_genomic.fna GCF_000009065.1_ASM906v1
bowtie2-build GCF_000834295.1_ASM83429v1_genomic.fna GCF_000834295.1_ASM83429v1
bowtie2-build GCF_009730055.1_ASM973005v1_genomic.fna GCF_009730055.1_ASM973005v1

Read mapping

​ We're now ready to map our reads. The following command will map reads using bowtie2 in end-to-end mode, and stream the output into samtools to convert into a sorted BAM file. ​

bowtie2 -x GCF_000009065.1_ASM906v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam

​ We can get a quick summary on the resulting coverage using ​

samtools coverage RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam

  • What is the average coverage we obtained on the main chromosome? ​ ​ Importantly, the coverage on this output will be overestimated, as we expect some proportion of the mapped reads to be PCR duplicates originating from the same DNA molecule. For downstream analyses, we first have to flag or remove those duplicate reads, e.g. using ​
samtools rmdup -s RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam  RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam

  • What fraction of the mapped reads were duplicates? ​ We can use the resulting file to check the new coverage ​
samtools coverage RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam

  • How much was the coverage reduced? ​ ​ We can also examine the coverage across the reference genome using a simple graphical view ​
samtools coverage -m RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam

  • Does the coverage look even across all contigs? ​ ​ Let's map the reads to the other two reference genomes, combining converting, sorting and duplicate removal into a single command ​
bowtie2 -x GCF_000834295.1_ASM83429v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - | samtools rmdup -s - RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam
bowtie2 -x GCF_009730055.1_ASM973005v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - | samtools rmdup -s - RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam

Damage estimates

​ To authenticate our data we need to examine whether the mapped reads show fragment lengths and post-mortem damage patterns typical of ancient DNA. We use mapDamage2 to compute and visualise those patterns ​

mapDamage -i RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam -r GCF_000009065.1_ASM906v1_genomic.fna --no-stats -d RISE505.GCF_000009065.1_ASM906v1
mapDamage -i RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam -r GCF_000834295.1_ASM83429v1_genomic.fna --no-stats -d RISE505.GCF_000834295.1_ASM83429v1
mapDamage -i RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam -r GCF_009730055.1_ASM973005v1_genomic.fna --no-stats -d RISE505.GCF_009730055.1_ASM973005v1

  • Are the damage patterns consistent with ancient Yersinia DNA? ​ ​

Detailed coverage statistics

​ The BEDTools suite is a very useful toolkit for obtaining more detailed coverage statistics that can be used in downstream analyses. First, we will use the genomecov tool to compute histograms for the number of bases covered at a certain coverage for all contigs in the assembly, as well as genome-wide. We will apply a mapping quality filter (MQ >= 20) to restrict our analyses to reads mapped with high confidence ​

samtools view -q20 -bh RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.genomecov.tsv
samtools view -q20 -bh RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.genomecov.tsv
samtools view -q20 -bh RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.genomecov.tsv

​ We can stratify these statistics further by genomic features (such as genes) provided in a gff file. Let's download the genomic annotations for the Yersinia pestis CO92 reference genome in gff format ​

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.gff.gz
gunzip GCF_000009065.1_ASM906v1_genomic.gff.gz

​ We compute coverage histograms for all features in the gff file using ​

samtools view -q20 -bh RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam | bedtools coverage -a GCF_000009065.1_ASM906v1_genomic.gff -b stdin -hist > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.genes.coverage.tsv

​ Now we have all primary results in place for further analysis and visualisation using R. We activate the R environment ​

conda activate r

​ An R script with some follow up analyses is provided in the file pathogens.R. You can open this file in VS Code Studio, and use an interactive R terminal to work through the commands

Group work

The afternoon will be available for following up on missing parts of the group analysis and assignments. Updating your group repositories and making 2-4 slides for the presentation tomorrow. Please make 1-2 slides with your answer to the Professor, that explains to him what you found and your interpretation. Please also reflect on either one finding you found interesting and/or one that you would like to discuss with the rest of us. It can be results but also methodological questions/issues or like.