Skip to content
Mikkel Winther Pedersen edited this page Oct 1, 2023 · 14 revisions

Sherlock Holmes exercise

If you are using the course virtual webserver

conda activate day1

The sequencing files you will need to map are single-end NGS data two different pathogens.

In this part you are given two compressed FASTQ files in the directory course/data/day1

One of these files is ancient, and one is modern. They are both reads from the pathogen Pseudomonas aeruginosa (reference genome fasta: paeruginosa.fasta.gz)

inspector.fastq.gz
barnaby.fastq.gz 

The goal is to ascertain which of the files (Inspector or Barnaby) belongs to the modern or ancient sample. For this exercise, you can look at the computer exercises from part1 and part2 for inspirations

Exercise 3:

  1. Find the data folder and copy it to your directory
cp -r course/data/day1/* .
  1. What is in the folder?
ls course/data/day1/* .
  1. What are the sizes of the files in the folder?
du -sk course/data/day1/*

Exercise 4:

  1. Figure out how many lines and how many reads the FASTQ files contain
zcat file.fastq.gz | wc -l
zcat file.fastq.gz | awk '{s++}END{print s/4}'
  1. Perform, using fastp, adapter trimming on the provided sample sequence files. Make sure to discard reads under 30 bp (-l parameter).
fastp -i file.fastq.gz -o file.trimmed.fastq -l 30
  1. What are the output files?
ls . 
  1. How many reads do the trimmed files contain?
cat file.trimmed.fastq|awk '{s++}END{print s/4}'
  1. What adapter sequences were detected? (look at the fastp output)

  2. How many reads did not pass the filter, and why? (fastp output)

  3. What is the average read length of the trimmed files?

cat file.trimmed.fastq | awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}

a. as loop

for i in *.gz
do
echo ${i}
zcat ${i} | awk '{s++}END{print s/4}'
done

Exercise 5

  1. Let us start by indexing the fasta
bwa index paeruginosa.fasta.gz

What files are generated?

  1. Start to map the data
bwa aln -t 5 paeruginosa.fasta.gz file.trimmed.fastq > file.trimmed.sai 
bwa samse paeruginosa.fasta.gz file.trimmed.sai file.trimmed.fastq > file.trimmed.sam
  1. Convert the sam files to bam
samtools view file.trimmed.sam -b > file.trimmed.bam
  1. Which files are bigger, the sam files or the bam files?
ls -l .
  1. Visually inspect the first 10 reads of a bam file
samtools view file.trimmed.bam | head
  1. Sort the bam files
samtools sort file.trimmed.bam -o file.trimmed.sorted.bam
  1. How many reads do the bam files contain?
samtools view -c file.trimmed.sorted.bam

Exercise 6

  1. Remove unmapped reads and reads with mapping quality less than 30
samtools view -b -F4 -q 30 file.trimmed.sorted.bam > file.filtered.bam
  1. How many reads were removed during filtering? (Compare output of exercise 5.6 and below command)
samtools view -c file.filtered.bam
  1. Remove duplicate reads
samtools rmdup file.filtered.bam file.filtered.rmdup.bam
  1. How many reads were duplicates?
samtools flagstat file.filtered.bam 
  1. Get the average length of the remaining reads
samtools view file.filtered.rmdup.bam | awk '{print length($10)}' | datamash mean 1 

Exercise 7

  1. Index the bam files
samtools index file.filtered.rmdup.bam
  1. Run mapDamage
mapDamage -i file.filtered.rmdup.bam -r paeruginosa.fasta.gz --merge-libraries --no-stats
  1. Look at the plots

Exercise 8

Start by activating your conda environment

conda activate XXXX

remove of duplicates using vsearch (https://github.com/torognes/vsearch) and filter sequences shorter than 30Bp

vsearch --help

After viewing the options try and run the test.fq

vsearch --fastx_uniques test.fq --fastqout test.vs.fq --minseqlength 30 --strand both

and we zip the file to compress the size

gunzip test.vs.fq

Read length distribution of the raw data how does it look., push to classroom

cat test.vs.fq.gz | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > test.vs.fq.gz.read_length.txt

Now view this file in the terminal

Mapping of reads to a custom database

Align your quality-checked reads (refer to this repository for suggested quality controls before alignment) against the database, an example: Bowtie2 (https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) and BWA (https://bio-bwa.sourceforge.net/bwa.shtml) are popular alignment tools used in bioinformatics to map short DNA sequences (reads) to a reference genome. Both Bowtie2 and BWA are alignment tools used to map short DNA reads to a reference genome. Bowtie2 is favoured for its speed and memory efficiency, while BWA is known for its versatility and accuracy, with different modes optimized for various read types and use cases. The choice between them depends on the specific requirements of your sequencing project.

There are alternatives to the alignment tools, where the k-mer-based metagenomic classification tools: Kraken2, Kraken-Uniq, and Kaiju, are probably the most used examples. While the k-mer matching approach is good for rapidly assigning taxonomic labels to DNA sequences. it suffers from accuracy and is often limited to taxonomic classification only.

In this tutorial, we will dig into the bowtie2 aligner and how to use it combined with downstream tools to not only taxonomically classify reads

The database is perhaps one of the most important driving factors in specificity (accuracy) and sensitivity (in faster modes reads might not match) of your analysis, besides the settings during the mapping. there is a balance between speed, sensitivity and specificity.

Group exercise

In the GitHub classroom, there is an exercise that finalises this day-1 course.

Group assignment "mapping your data"

Now with the knowledge you have obtained in today's exercises please pair up in your groups and begin by

  1. draw an outline of a workflow for each step in which you plan to process the 5 samples from Prof. Iriarte.
  2. for each step discuss with your group members which parameters to set and write down the arguments for selecting these.
  3. next make first a loop that removes duplicated sequences in your files
  4. then make a loop that one by one will loop over each dereplicated file and map each read to the provided database using --threads 5.
  5. write all commands in your GitHub repo, with comments explaining what each step does. And push the changes to finalise the assignment.
  6. if time permits you could also add the read length extraction