Day 5

Ancient pathogens

Introduction

In this tutorial we will analyse ancient plague DNA.

Set up

We will map our putative ancient plague sequencing reads against the reference genomes of three different species of Yersinia

Yersinia pestis, the causative agent of plague (GCF000009065.1)
Yersinia pseudotuberculosis, the closest relative species of Y pestis and a rare, usually mild food-borne pathogen (GCF_000834295.1)
Yersinia intermedia, a more distantly related species commonly found in the environment (GCF_009730055.1) first, some environment and program set up

conda activate mapping
mamba install bedtools
mkdir pathogens
cd pathogens
ln -s ~/course/data/day5/* .

We will use bowtie2 to map our sequencing reads to these reference genomes. To do so we first need to build an index for each reference

bowtie2-build GCF_000009065.1_ASM906v1_genomic.fna GCF_000009065.1_ASM906v1
bowtie2-build GCF_000834295.1_ASM83429v1_genomic.fna GCF_000834295.1_ASM83429v1
bowtie2-build GCF_009730055.1_ASM973005v1_genomic.fna GCF_009730055.1_ASM973005v1

Read mapping

We're now ready to map our reads. The following command will map reads using bowtie2 in end-to-end mode, and stream the output into samtools to convert into a sorted BAM file.

bowtie2 -x GCF_000009065.1_ASM906v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam

We can get a quick summary on the resulting coverage using

samtools coverage RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam

What is the average coverage we obtained on the main chromosome? Importantly, the coverage on this output will be overestimated, as we expect some proportion of the mapped reads to be PCR duplicates originating from the same DNA molecule. For downstream analyses, we first have to flag or remove those duplicate reads, e.g. using

samtools rmdup -s RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam  RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam

What fraction of the mapped reads were duplicates? We can use the resulting file to check the new coverage

samtools coverage RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam

How much was the coverage reduced? We can also examine the coverage across the reference genome using a simple graphical view

samtools coverage -m RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam

Does the coverage look even across all contigs? Let's map the reads to the other two reference genomes, combining converting, sorting and duplicate removal into a single command

bowtie2 -x GCF_000834295.1_ASM83429v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - | samtools rmdup -s - RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam
bowtie2 -x GCF_009730055.1_ASM973005v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - | samtools rmdup -s - RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam

Damage estimates

To authenticate our data we need to examine whether the mapped reads show fragment lengths and post-mortem damage patterns typical of ancient DNA. We use mapDamage2 to compute and visualise those patterns

mapDamage -i RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam -r GCF_000009065.1_ASM906v1_genomic.fna --no-stats -d RISE505.GCF_000009065.1_ASM906v1
mapDamage -i RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam -r GCF_000834295.1_ASM83429v1_genomic.fna --no-stats -d RISE505.GCF_000834295.1_ASM83429v1
mapDamage -i RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam -r GCF_009730055.1_ASM973005v1_genomic.fna --no-stats -d RISE505.GCF_009730055.1_ASM973005v1

Are the damage patterns consistent with ancient Yersinia DNA?

Detailed coverage statistics

The BEDTools suite is a very useful toolkit for obtaining more detailed coverage statistics that can be used in downstream analyses. First, we will use the genomecov tool to compute histograms for the number of bases covered at a certain coverage for all contigs in the assembly, as well as genome-wide. We will apply a mapping quality filter (MQ >= 20) to restrict our analyses to reads mapped with high confidence

samtools view -q20 -bh RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.genomecov.tsv
samtools view -q20 -bh RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.genomecov.tsv
samtools view -q20 -bh RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.genomecov.tsv

We can stratify these statistics further by genomic features (such as genes) provided in a gff file. Let's download the genomic annotations for the Yersinia pestis CO92 reference genome in gff format

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.gff.gz
gunzip GCF_000009065.1_ASM906v1_genomic.gff.gz

We compute coverage histograms for all features in the gff file using

samtools view -q20 -bh RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam | bedtools coverage -a GCF_000009065.1_ASM906v1_genomic.gff -b stdin -hist > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.genes.coverage.tsv

Now we have all primary results in place for further analysis and visualisation using R. We activate the R environment

conda activate r

An R script with some follow up analyses is provided in the file pathogens.R. You can open this file in VS Code Studio, and use an interactive R terminal to work through the commands

Group work

The afternoon will be available for following up on missing parts of the group analysis and assignments. Updating your group repositories and making 2-4 slides for the presentation tomorrow. Please make 1-2 slides with your answer to the Professor, that explains to him what you found and your interpretation. Please also reflect on either one finding you found interesting and/or one that you would like to discuss with the rest of us. It can be results but also methodological questions/issues or like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly