Day 1

Sherlock Holmes exercise

If you are using the course virtual webserver

conda activate day1

The sequencing files you will need to map are single-end NGS data two different pathogens.

In this part you are given two compressed FASTQ files in the directory course/data/day1

One of these files is ancient, and one is modern. They are both reads from the pathogen Pseudomonas aeruginosa (reference genome fasta: paeruginosa.fasta.gz)

inspector.fastq.gz
barnaby.fastq.gz

The goal is to ascertain which of the files (Inspector or Barnaby) belongs to the modern or ancient sample. For this exercise, you can look at the computer exercises from part1 and part2 for inspirations

Exercise 3:

Find the data folder and copy it to your directory

cp -r course/data/day1/* .

What is in the folder?

ls course/data/day1/* .

What are the sizes of the files in the folder?

du -sk course/data/day1/*

Exercise 4:

Figure out how many lines and how many reads the FASTQ files contain

zcat file.fastq.gz | wc -l

zcat file.fastq.gz | awk '{s++}END{print s/4}'

Perform, using fastp, adapter trimming on the provided sample sequence files. Make sure to discard reads under 30 bp (-l parameter).

fastp -i file.fastq.gz -o file.trimmed.fastq -l 30

What are the output files?

ls .

How many reads do the trimmed files contain?

cat file.trimmed.fastq|awk '{s++}END{print s/4}'

What adapter sequences were detected? (look at the fastp output)
How many reads did not pass the filter, and why? (fastp output)
What is the average read length of the trimmed files?

cat file.trimmed.fastq | awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}

a. as loop

for i in *.gz
do
echo ${i}
zcat ${i} | awk '{s++}END{print s/4}'
done

Exercise 5

Let us start by indexing the fasta

bwa index paeruginosa.fasta.gz

What files are generated?

Start to map the data

bwa aln -t 5 paeruginosa.fasta.gz file.trimmed.fastq > file.trimmed.sai

bwa samse paeruginosa.fasta.gz file.trimmed.sai file.trimmed.fastq > file.trimmed.sam

Convert the sam files to bam

samtools view file.trimmed.sam -b > file.trimmed.bam

Which files are bigger, the sam files or the bam files?

ls -l .

Visually inspect the first 10 reads of a bam file

samtools view file.trimmed.bam | head

Sort the bam files

samtools sort file.trimmed.bam -o file.trimmed.sorted.bam

How many reads do the bam files contain?

samtools view -c file.trimmed.sorted.bam

Exercise 6

Remove unmapped reads and reads with mapping quality less than 30

samtools view -b -F4 -q 30 file.trimmed.sorted.bam > file.filtered.bam

How many reads were removed during filtering? (Compare output of exercise 5.6 and below command)

samtools view -c file.filtered.bam

Remove duplicate reads

samtools rmdup file.filtered.bam file.filtered.rmdup.bam

How many reads were duplicates?

samtools flagstat file.filtered.bam

Get the average length of the remaining reads

samtools view file.filtered.rmdup.bam | awk '{print length($10)}' | datamash mean 1

Exercise 7

Index the bam files

samtools index file.filtered.rmdup.bam

Run mapDamage

mapDamage -i file.filtered.rmdup.bam -r paeruginosa.fasta.gz --merge-libraries --no-stats

Look at the plots

Exercise 8

Start by activating your conda environment

conda activate XXXX

remove of duplicates using vsearch (https://github.com/torognes/vsearch) and filter sequences shorter than 30Bp

vsearch --help

After viewing the options try and run the test.fq

vsearch --fastx_uniques test.fq --fastqout test.vs.fq --minseqlength 30 --strand both

and we zip the file to compress the size

gunzip test.vs.fq

Read length distribution of the raw data how does it look., push to classroom

cat test.vs.fq.gz | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > test.vs.fq.gz.read_length.txt

Now view this file in the terminal

Mapping of reads to a custom database

Align your quality-checked reads (refer to this repository for suggested quality controls before alignment) against the database, an example: Bowtie2 (https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) and BWA (https://bio-bwa.sourceforge.net/bwa.shtml) are popular alignment tools used in bioinformatics to map short DNA sequences (reads) to a reference genome. Both Bowtie2 and BWA are alignment tools used to map short DNA reads to a reference genome. Bowtie2 is favoured for its speed and memory efficiency, while BWA is known for its versatility and accuracy, with different modes optimized for various read types and use cases. The choice between them depends on the specific requirements of your sequencing project.

There are alternatives to the alignment tools, where the k-mer-based metagenomic classification tools: Kraken2, Kraken-Uniq, and Kaiju, are probably the most used examples. While the k-mer matching approach is good for rapidly assigning taxonomic labels to DNA sequences. it suffers from accuracy and is often limited to taxonomic classification only.

In this tutorial, we will dig into the bowtie2 aligner and how to use it combined with downstream tools to not only taxonomically classify reads

The database is perhaps one of the most important driving factors in specificity (accuracy) and sensitivity (in faster modes reads might not match) of your analysis, besides the settings during the mapping. However, there should be a balance between speed, sensitivity and specificity.

Given that ancient environmental DNA often involves short (100-50 Bp) to ultra-short (50-30Bp) DNA molecules, it is recommended to use end-to-end alignments, which are the default setting in the bowtie2 options. Why we have preferred bowtie2 over bwa, is explained by the fact that bowtie2 offers crucial options that for example allow us to parse multiple alignments to different references by one read which bwa does not. This is important as conserved regions across different species can recruit identical reads and hence by default settings in both mappers will get a low mapping quality. Now this would often be filtered in normal ancient DNA pipelines, however, these conserved regions can still have taxonomical value.

Group assignment

In the GitHub classroom, there is an assignment that finalises this day-1 course.

Group assignment "mapping your data"

Now with the knowledge you have obtained in today's exercises please pair up in your groups and begin by

draw an outline of a workflow for each step in which you plan to process the 5 samples from Prof. Iriarte.
for each step discuss with your group members which parameters to set and write down the arguments for selecting these.
next make first a loop that removes duplicated sequences in your files
then make a loop that one by one will loop over each dereplicated file and map each read to the provided database using --threads 5.
write all commands in your GitHub repo, with comments explaining what each step does. And push the changes to finalise the assignment.
if time permits you could also add the read length extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly