-
Notifications
You must be signed in to change notification settings - Fork 0
/
QC on NGS reads
64 lines (56 loc) · 3.8 KB
/
QC on NGS reads
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
FASTQC
######
FASTQC for generating the QC report for paired end and single end reads
FastqPuri: high-performance preprocessing of RNA-seq data (2019 superior)
##################
released in 2019 superior over exsisting tools
SolexaQA
########
SolexaQA calculates sequence quality statistics and creates visual representations of data quality for second-generation sequencing data.
Originally developed for the Illumina system (historically known as “Solexa”), SolexaQA now also supports Ion Torrent and 454 data.
Running directly on FASTQ files (now with support for compressed files too), the software has three component algorithms:
Analysis — the primary quality analysis and visualization tool. Designed to run on unmodified FASTQ files obtained directly from Illumina,
Ion Torrent or 454 sequencers.
DynamicTrim — a read trimmer that individually crops each read to its longest contiguous segment for which
quality scores are greater than a user-supplied quality cutoff (or alternately, the read segment returned by the BWA trimming algorithm).
LengthSort — a program to separate high quality reads from low quality reads.
LengthSort assigns trimmed reads to paired-end, singleton and discard files based on a user-defined length cutoff.
Trimmomatic
###########
For trimming the regions which are low in quality, adaptors etc, but not GC content correction
http://www.usadellab.org/cms/?page=trimmomatic
http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
PRINSEQ
######
For GC content correction and specifically for reads from 454/Roche use http://prinseq.sourceforge.net/manual.html#QCGC
BBDUK
#####
used for sbv inprover benchmarking dataset for filtering illumina adapter removal, quality filter for reads
useful links
------------
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/
http://seqanswers.com/forums/showthread.php?t=42776
issues --> https://www.biostars.org/p/250149/
adapter removal, quality filter
--> hard to balance R1 and R2 quality filter since, R1 qualified all fastqc filters doesn't mean it does for R2, need to fine tune the
parameters to get R1 & R2 balance.
sample command --> ../bbmap/bbduk.sh in1=ABRF_MGRG_1ng_R1_sample_1L.fastq in2=ABRF_MGRG_1ng_R2_sample_1L.fastq
out1=ABRF_MGRG_1ng_R1_sample_bbduk_trimmed_l20_q30_tpe_tbo out2=ABRF_MGRG_1ng_R2_sample_bbduk_trimmed_l20_q30_tpe_tbo
qtrim=rl trimq=30 ml=20 ref=../bbmap/resources/adapters.fa tpe tbo
**better for single end data
--> instead use trimmomatic which solves the R1 & R2 filetering in the same way and no much fine tuning required
sample command - /drives/c/Program\ Files/Java/jre1.8.0_112/bin/java.exe -jar trimmomatic-0.38.jar PE -phred33 /drives/c
/Users/narsapvi/Downloads/BenchmarkigDatasets/ABRF_MGRG_1ng_R1.fastq.gz /drives/c/Users/narsapvi/Downloads/BenchmarkigDatasets/ABRF_
MGRG_1ng_R2.fastq.gz /drives/c/Users/narsapvi/Downloads/BenchmarkigDatasets/ABRF_MGRG_1ng_R1_PE.fastq.gz /drives/c/Users/narsapvi/Do
wnloads/BenchmarkigDatasets/ABRF_MGRG_1ng_R1_SE.fastq.gz /drives/c/Users/narsapvi/Downloads/BenchmarkigDatasets/ABRF_MGRG_1ng_R2_PE.
fastq.gz /drives/c/Users/narsapvi/Downloads/BenchmarkigDatasets/ABRF_MGRG_1ng_R2_SE.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEAD
ING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
**better for paired end data
SBV Challenge Insights
###############
In case of metagenomic sample GC content:
The GC content distribution of most samples should follow a normal distribution.
In some cases, a bi-modal distribution can be observed, especially for metagenomic data sets.
They have used "MetaSim—A Sequencing Simulator for Genomics and Metagenomic" to generate metagenomic reads
MetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios
for planning sequencing projects or for benchmarking metagenomic software.