Skip to content
ANGSD edited this page Sep 23, 2024 · 21 revisions

Small data exercise

Exercise 1 and 2 are found in slides.

The main conclusion from day one exercise is the large group exercise at the bottom of this page (starting at Exercise 8). This, however, takes too long to run in a classroom setting. Therefore, all exercises prior to the group exercise are based on two very small (simulated) test files that mimic the characteristics of modern and ancient DNA.

Solution or example commands can be found here: https://docs.google.com/document/d/1OcXQVS-ZC_86BIoVgEWlEC0Tgnwu1gW7kKrADj7Y2Mo/edit?usp=sharing

If you are using the course virtual webserver

conda activate env1

The sequencing files you will need to map are single-end NGS data two different pathogens.

In this part you are given two compressed FASTQ files in the directory course/data/day1

One of these files is ancient, and one is modern. They are both reads from the pathogen Pseudomonas aeruginosa (reference genome fasta: paeruginosa.fasta.gz)

inspector.fastq.gz
barnaby.fastq.gz 

The goal is to ascertain which of the files (Inspector or Barnaby) belongs to the modern or ancient sample. For this exercise, you can look at the computer exercises from part1 and part2 for inspirations

Exercise 3:

  1. Find the data folder and copy it to your directory
cp -r ~/course/data/day1/* .
  1. What is in the folder?
ls ~/course/data/day1/* .
  1. What are the sizes of the files in the folder?
du -sk ~/course/data/day1/*
  1. Validate that you have copied the data to your folder
pwd
ls -l

Exercise 4:

Note that for the following exercises the commands are examples, you will need to change the file names to suit your FASTQs, and run each command twice (one for each file).

  1. Figure out how many lines and how many reads the FASTQ files contain
zcat file.fastq.gz | wc -l
zcat file.fastq.gz | awk '{s++}END{print s/4}'
  1. Perform, using fastp, adapter trimming on the provided sample sequence files. Make sure to discard reads under 30 bp (-l parameter).
fastp -i file.fastq.gz -o file.trimmed.fastq -l 30
  1. What are the output files?
ls . 
  1. How many reads do the trimmed files contain?
cat file.trimmed.fastq|awk '{s++}END{print s/4}'
  1. What adapter sequences were detected? (look at the fastp output)

  2. How many reads did not pass the filter, and why? (fastp output)

  3. What is the average read length of the trimmed files?

cat file.trimmed.fastq | awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}'

a. as loop

for i in *.gz
do
echo ${i}
zcat ${i} | awk '{s++}END{print s/4}'
done

Exercise 5

  1. Let us start by indexing the fasta
bowtie2-build reference.fasta.gz bw2index

What files are generated?

  1. Start to map the data
bowtie2 -x bw2index -U file.trimmed.gz -S output.sam
##or step1 and step2 if we are using bwa
bwa index reference.fasta.gz
bwa aln -t 5 reference.fasta.gz file.trimmed.fastq > file.trimmed.sai 
bwa samse reference.fasta.gz file.trimmed.sai file.trimmed.fastq > file.trimmed.sam
  1. Convert the sam files to bam
samtools view file.trimmed.sam -b > file.trimmed.bam
  1. Which files are bigger, the sam files or the bam files?
ls -l .
  1. Visually inspect the first 10 reads of a bam file
samtools view file.trimmed.bam | head
  1. Sort the bam files
samtools sort file.trimmed.bam -o file.trimmed.sorted.bam
  1. How many reads do the bam files contain?
samtools view -c file.trimmed.sorted.bam

Exercise 6

  1. How many unmapped reads are there in both bams?
samtools flagstat file.trimmed.sorted.bam
  1. Remove unmapped reads and reads with mapping quality less than 30
samtools view -b -F4 -q 30 file.trimmed.sorted.bam > file.filtered.bam
  1. How many reads were removed during filtering? (Compare output of exercise 6.1 and below command)
samtools view -c file.filtered.bam
  1. Remove duplicate reads
samtools rmdup -s file.filtered.bam file.filtered.rmdup.bam
  1. How many reads were duplicates?
samtools view -c file.filtered.rmdup.bam 
  1. Get the average length of the remaining reads
samtools view file.filtered.rmdup.bam | awk '{print length($10)}' | datamash mean 1 

Exercise 7

  1. Index the bam files
samtools index file.filtered.rmdup.bam
  1. Run mapDamage
mapDamage -i file.filtered.rmdup.bam -r paeruginosa.fasta.gz --no-stats
  1. Look at the plots

Exercise 8

Start by activating your conda environment

conda activate env2

Let's check what tools are installed in this environment

conda list

Do you have the tools you need for the next step? e.g. vsearch? if not then you checkout which other environments exists.

conda env list

Once you have the correct environment, go to your working directory (wdir) and make a new folder called mapping

cd ~/course/wdir/
mkdir mapping
cd mapping

Make a symbolic link to one of the fastq files

ln -s ~/course/data/day1/fastq/PRI-TJPGK-CATN-26-28.fq.gz .
ls -l

Let's check how the file looks to verify that it is a fastq file and that the file in fact have been trimmed.

zcat PRI-TJPGK-CATN-26-28.fq.gz | head -40

This prints the first 40 lines of the file to stdout.

remove of duplicates using vsearch and filter sequences shorter than 30Bp

vsearch --help

After viewing the options try and run the test.fq

vsearch --fastx_uniques PRI-TJPGK-CATN-26-28.fq.gz --fastqout PRI-TJPGK-CATN-26-28.vs.fq --minseqlength 30 --strand both

and we zip the file to compress the size. It is always good practice to compress your files when possible.

gzip PRI-TJPGK-CATN-26-28.vs.fq

We can now check that the read count has changed in the files by simply counting the lines in each file

zcat PRI-TJPGK-CATN-26-28.vs.fq.gz | wc -l
zcat PRI-TJPGK-CATN-26-28.fq.gz | wc -l

Read length distribution of the raw data how does it look.

zcat PRI-TJPGK-CATN-26-28.vs.fq.gz | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > PRI-TJPGK-CATN-26-28.vs.fq.gz.read_length.txt

Now view this file in the terminal using more or cat or a text editor.

Mapping of reads to a custom database

Align your quality-checked reads (refer to this repository for suggested quality controls before alignment) against the database, an example: Bowtie2 (https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) and BWA (https://bio-bwa.sourceforge.net/bwa.shtml) are popular alignment tools used in bioinformatics to map short DNA sequences (reads) to a reference genome. Both Bowtie2 and BWA are alignment tools used to map short DNA reads to a reference genome. Bowtie2 is favoured for its speed and memory efficiency, while BWA is known for its versatility and accuracy, with different modes optimized for various read types and use cases. The choice between them depends on the specific requirements of your sequencing project.

There are alternatives to the alignment tools, where the k-mer-based metagenomic classification tools: Kraken2, Kraken-Uniq, and Kaiju, are probably the most used examples. While the k-mer matching approach is good for rapidly assigning taxonomic labels to DNA sequences. it suffers from accuracy and is often limited to taxonomic classification only.

In this tutorial, we will dig into the bowtie2 aligner and how to use it combined with downstream tools to not only taxonomically classify reads

The database is perhaps one of the most important driving factors in specificity (accuracy) and sensitivity (in faster modes reads might not match) of your analysis, besides the settings during the mapping. However, there should be a balance between speed, sensitivity and specificity.

Given that ancient environmental DNA often involves short (100-50 Bp) to ultra-short (50-30Bp) DNA molecules, it is recommended to use end-to-end alignments, which are the default setting in the bowtie2 options. Why we have preferred bowtie2 over bwa, is explained by the fact that bowtie2 offers crucial options that for example allow us to parse multiple alignments to different references by one read which bwa does not. This is important as conserved regions across different species can recruit identical reads and hence by default settings in both mappers will get a low mapping quality. Now this would often be filtered in normal ancient DNA pipelines, however, these conserved regions can still have taxonomical value.

The database is perhaps one of the most important driving factors in specificity (accuracy) and sensitivity (in faster modes reads might not match) of your analysis, besides the settings during the mapping. there is a balance between speed, sensitivity and specificity.

First, activate the environment that has bowtie2 and samtools

conda activate env1

Now explore the bowtie2 manual pages, you can quickly see the shortcuts for all options, default settings and

bowtie2 --help

By default, bowtie2 reports the alignments in uncompressed sam format to the stdout (printed in the terminal window).

bowtie2 --threads 5 -k 100 -x ~/course/data/day1/db/aegenomics.db3 -U PRI-TJPGK-CATN-26-28.vs.fq.gz --no-unal

One can therefore pipe the output into samtools view specifying that it should be converted/saved as a compressed .sam file e.g. .bam. In the example below we are running the end-to-end alignment with --sensitive (-D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default))

bowtie2 --threads 5 -k 100 -x ~/course/data/day1/db/aegenomics.db3 -U PRI-TJPGK-CATN-26-28.vs.fq.gz --no-unal | samtools view -bS - > PRI-TJPGK-CATN-26-28.bam

It should now print something like this to stdout

3906012 reads; of these: 3906012 (100.00%) were unpaired; of these: 11682 (0.30%) aligned 0 times 1138334 (29.14%) aligned exactly 1 time 2755996 (70.56%) aligned >1 times 99.70% overall alignment rate

If so, then you are ready to find your group members and continue with the group assignment. Please also, but please start with reflecting what these statistics mean and about the overall alignment rate. Hint, would it be natural to have almost 100% of the reads aligned from a metagenome?

Group assignment

In the GitHub classroom, there is an assignment that finalises this day-1 course.

Group assignment "mapping your data"

Now with the knowledge you have obtained in today's exercises please pair up in your groups and begin by

  1. draw an outline of a workflow for each step in which you plan to process the 6 samples from Prof. Brown. 1a. make a symbolic link to the remaining files using the command below
ln -s ~/course/data/shared/fastq/*fq.gz ~/course/wdir/mapping/
  1. for each step discuss with your group members which parameters to set and write down the arguments for selecting these.
  2. next make first a loop that removes duplicated sequences in your files
  3. then make a loop that one by one will loop over each dereplicated file and map each read to the provided database using --threads 5.
  4. write all commands in your GitHub repo by accepting the classroom assignment, with comments explaining what each step does. And push the changes to finalise the assignment.
  5. if time permits you could also add the read length extraction and even a plot perhaps?