-
Notifications
You must be signed in to change notification settings - Fork 4
Day 1
conda activate day1
The sequencing files you will need to map are single-end NGS data two different pathogens.
In this part you are given two compressed FASTQ files in the directory course/data/day1
One of these files is ancient, and one is modern. They are both reads from the pathogen Pseudomonas aeruginosa (reference genome fasta: paeruginosa.fasta.gz)
inspector.fastq.gz
barnaby.fastq.gz
The goal is to ascertain which of the files (Inspector or Barnaby) belongs to the modern or ancient sample. For this exercise, you can look at the computer exercises from part1 and part2 for inspirations
- Find the data folder and copy it to your directory
cp -r course/data/day1/* .
- What is in the folder?
ls course/data/day1/* .
- What are the sizes of the files in the folder?
du -sk course/data/day1/*
- Figure out how many lines and how many reads the FASTQ files contain
zcat file.fastq.gz | wc -l
zcat file.fastq.gz | awk '{s++}END{print s/4}'
- Perform, using fastp, adapter trimming on the provided sample sequence files. Make sure to discard reads under 30 bp (-l parameter).
fastp -i file.fastq.gz -o file.trimmed.fastq -l 30
- What are the output files?
ls .
- How many reads do the trimmed files contain?
cat file.trimmed.fastq|awk '{s++}END{print s/4}'
-
What adapter sequences were detected? (look at the fastp output)
-
How many reads did not pass the filter, and why? (fastp output)
-
What is the average read length of the trimmed files?
cat file.trimmed.fastq | awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}
a. as loop
for i in *.gz
do
echo ${i}
zcat ${i} | awk '{s++}END{print s/4}'
done
- Let us start by indexing the fasta
bwa index paeruginosa.fasta.gz
What files are generated?
- Start to map the data
bwa aln -t 5 paeruginosa.fasta.gz file.trimmed.fastq > file.trimmed.sai
bwa samse paeruginosa.fasta.gz file.trimmed.sai file.trimmed.fastq > file.trimmed.sam
- Convert the sam files to bam
samtools view file.trimmed.sam -b > file.trimmed.bam
- Which files are bigger, the sam files or the bam files?
ls -l .
- Visually inspect the first 10 reads of a bam file
samtools view file.trimmed.bam | head
- Sort the bam files
samtools sort file.trimmed.bam -o file.trimmed.sorted.bam
- How many reads do the bam files contain?
samtools view -c file.trimmed.sorted.bam
- Remove unmapped reads and reads with mapping quality less than 30
samtools view -b -F4 -q 30 file.trimmed.sorted.bam > file.filtered.bam
- How many reads were removed during filtering? (Compare output of exercise 5.6 and below command)
samtools view -c file.filtered.bam
- Remove duplicate reads
samtools rmdup file.filtered.bam file.filtered.rmdup.bam
- How many reads were duplicates?
samtools flagstat file.filtered.bam
- Get the average length of the remaining reads
samtools view file.filtered.rmdup.bam | awk '{print length($10)}' | datamash mean 1
- Index the bam files
samtools index file.filtered.rmdup.bam
- Run mapDamage
mapDamage -i file.filtered.rmdup.bam -r paeruginosa.fasta.gz --merge-libraries --no-stats
- Look at the plots
conda activate XXXX
remove of duplicates using vsearch (https://github.com/torognes/vsearch) and filter sequences shorter than 30Bp
vsearch --help
After viewing the options try and run the test.fq
vsearch --fastx_uniques test.fq --fastqout test.vs.fq --minseqlength 30 --strand both
and we zip the file to compress the size
gunzip test.vs.fq
cat test.vs.fq.gz | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > test.vs.fq.gz.read_length.txt
Now view this file in the terminal
Align your quality-checked reads (refer to this repository for suggested quality controls before alignment) against the database, an example: Bowtie2 (https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) and BWA (https://bio-bwa.sourceforge.net/bwa.shtml) are popular alignment tools used in bioinformatics to map short DNA sequences (reads) to a reference genome. Both Bowtie2 and BWA are alignment tools used to map short DNA reads to a reference genome. Bowtie2 is favoured for its speed and memory efficiency, while BWA is known for its versatility and accuracy, with different modes optimized for various read types and use cases. The choice between them depends on the specific requirements of your sequencing project.
There are alternatives to the alignment tools, where the k-mer-based metagenomic classification tools: Kraken2, Kraken-Uniq, and Kaiju, are probably the most used examples. While the k-mer matching approach is good for rapidly assigning taxonomic labels to DNA sequences. it suffers from accuracy and is often limited to taxonomic classification only.
In this tutorial, we will dig into the bowtie2 aligner and how to use it combined with downstream tools to not only taxonomically classify reads
The database is perhaps one of the most important driving factors in specificity (accuracy) and sensitivity (in faster modes reads might not match) of your analysis, besides the settings during the mapping. However, there should be a balance between speed, sensitivity and specificity.
Given that ancient environmental DNA often involves short (100-50 Bp) to ultra-short (50-30Bp) DNA molecules, it is recommended to use end-to-end alignments, which are the default setting in the bowtie2 options. Why we have preferred bowtie2 over bwa, is explained by the fact that bowtie2 offers crucial options that for example allow us to parse multiple alignments to different references by one read which bwa does not. This is important as conserved regions across different species can recruit identical reads and hence by default settings in both mappers will get a low mapping quality. Now this would often be filtered in normal ancient DNA pipelines, however, these conserved regions can still have taxonomical value.
In the GitHub classroom, there is an assignment that finalises this day-1 course.
Now with the knowledge you have obtained in today's exercises please pair up in your groups and begin by
- draw an outline of a workflow for each step in which you plan to process the 5 samples from Prof. Iriarte.
- for each step discuss with your group members which parameters to set and write down the arguments for selecting these.
- next make first a loop that removes duplicated sequences in your files
- then make a loop that one by one will loop over each dereplicated file and map each read to the provided database using --threads 5.
- write all commands in your GitHub repo, with comments explaining what each step does. And push the changes to finalise the assignment.
- if time permits you could also add the read length extraction