-
Notifications
You must be signed in to change notification settings - Fork 1
Day 5
In this tutorial we will analyse ancient plague DNA.
We will map our putative ancient plague sequencing reads against the reference genomes of three different species of Yersinia
-
Yersinia pestis, the causative agent of plague (
GCF000009065.1
) -
Yersinia pseudotuberculosis, the closest relative species of Y pestis and a rare, usually mild food-borne pathogen (
GCF_000834295.1
) -
Yersinia intermedia, a more distantly related species commonly found in the environment (
GCF_009730055.1
) first, some environment and program set up
conda activate mapping
mamba install bedtools
mkdir pathogens
cd pathogens
ln -s ~/course/data/day5/* .
We will use bowtie2
to map our sequencing reads to these reference genomes. To do so we first need to build an index for each reference
bowtie2-build GCF_000009065.1_ASM906v1_genomic.fna GCF_000009065.1_ASM906v1
bowtie2-build GCF_000834295.1_ASM83429v1_genomic.fna GCF_000834295.1_ASM83429v1
bowtie2-build GCF_009730055.1_ASM973005v1_genomic.fna GCF_009730055.1_ASM973005v1
We're now ready to map our reads. The following command will map reads using bowtie2
in end-to-end
mode, and stream the output into samtools
to convert into a sorted BAM file.
bowtie2 -x GCF_000009065.1_ASM906v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam
We can get a quick summary on the resulting coverage using
samtools coverage RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam
- What is the average coverage we obtained on the main chromosome? Importantly, the coverage on this output will be overestimated, as we expect some proportion of the mapped reads to be PCR duplicates originating from the same DNA molecule. For downstream analyses, we first have to flag or remove those duplicate reads, e.g. using
samtools rmdup -s RISE505.GCF_000009065.1_ASM906v1.mapped.sort.bam RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam
- What fraction of the mapped reads were duplicates? We can use the resulting file to check the new coverage
samtools coverage RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam
- How much was the coverage reduced? We can also examine the coverage across the reference genome using a simple graphical view
samtools coverage -m RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam
- Does the coverage look even across all contigs? Let's map the reads to the other two reference genomes, combining converting, sorting and duplicate removal into a single command
bowtie2 -x GCF_000834295.1_ASM83429v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - | samtools rmdup -s - RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam
bowtie2 -x GCF_009730055.1_ASM973005v1 -p 4 --end-to-end -U RISE505.fq.gz | samtools view -bh | samtools sort - | samtools rmdup -s - RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam
To authenticate our data we need to examine whether the mapped reads show fragment lengths and post-mortem damage patterns typical of ancient DNA. We use mapDamage2
to compute and visualise those patterns
mapDamage -i RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam -r GCF_000009065.1_ASM906v1_genomic.fna --no-stats -d RISE505.GCF_000009065.1_ASM906v1
mapDamage -i RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam -r GCF_000834295.1_ASM83429v1_genomic.fna --no-stats -d RISE505.GCF_000834295.1_ASM83429v1
mapDamage -i RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam -r GCF_009730055.1_ASM973005v1_genomic.fna --no-stats -d RISE505.GCF_009730055.1_ASM973005v1
- Are the damage patterns consistent with ancient Yersinia DNA?
The BEDTools
suite is a very useful toolkit for obtaining more detailed coverage statistics that can be used in downstream analyses. First, we will use the genomecov
tool to compute histograms for the number of bases covered at a certain coverage for all contigs in the assembly, as well as genome-wide. We will apply a mapping quality filter (MQ >= 20) to restrict our analyses to reads mapped with high confidence
samtools view -q20 -bh RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.genomecov.tsv
samtools view -q20 -bh RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_000834295.1_ASM83429v1.mapped.sort.rmdup.genomecov.tsv
samtools view -q20 -bh RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.bam | bedtools genomecov -ibam stdin > RISE505.GCF_009730055.1_ASM973005v1.mapped.sort.rmdup.genomecov.tsv
We can stratify these statistics further by genomic features (such as genes) provided in a gff
file. Let's download the genomic annotations for the Yersinia pestis CO92 reference genome in gff
format
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.gff.gz
gunzip GCF_000009065.1_ASM906v1_genomic.gff.gz
We compute coverage histograms for all features in the gff
file using
samtools view -q20 -bh RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.bam | bedtools coverage -a GCF_000009065.1_ASM906v1_genomic.gff -b stdin -hist > RISE505.GCF_000009065.1_ASM906v1.mapped.sort.rmdup.genes.coverage.tsv
Now we have all primary results in place for further analysis and visualisation using R
. We activate the R
environment
conda activate r
An R
script with some follow up analyses is provided in the file pathogens.R
. You can open this file in VS Code Studio, and use an interactive R
terminal to work through the commands
The afternoon will be available for following up on missing parts of the group analysis and assignments. Updating your group repositories and making 2-4 slides for the presentation tomorrow. Please make 1-2 slides with your answer to Professor Xavier, that explains to him what you found and your interpretation. Please also reflect on either one finding you found interesting and/or one that you would like to discuss with the rest of us. It can be results but also methodological questions/issues or like.