Skip to content

ngs_TRIM

Stephen Fisher edited this page Aug 22, 2013 · 10 revisions

Module: TRIM

Trim reads using contaminants listed in the 'contaminants.fa' file.

Usage: ngs.sh trim [-i inputDir] [-c contaminantsFile] [-m minLen] [-se] sampleID
Input:
	/lab/repo/resources/trim/contaminants.fa (file containing contaminants)
	sampleID/inputDir/unaligned_1.fq
	sampleID/inputDir/unaligned_2.fq (paired-end reads)
Output:
	sampleID/trim/unaligned_1.fq
	sampleID/trim/unaligned_2.fq (paired-end reads)
	sampleID/trim/stats.txt
	sampleID/trim/contaminants.fa (contaminants file)
Requires:
	trimReads.py ( https://github.com/safisher/ngs ) 
Options:
	-i inputDir - location of source files (default: orig).
	-c contaminantsFile - file containing contaminants to be trimmed (default: /lab/repo/resources/trim/contaminants.fa).
	-m minLen - Minimum size of trimmed read. If trimmed beyond minLen, then read is discarded. If read is paired then read is replaced with N's, unless both reads in pair are smaller than minLen in which case the pair is discarded. (default: 20).
	-se - single-end reads (default: paired-end)

Trimmed data is placed in 'sampleID/trim'. The contaminants file that was used is copied into the trim directory for future reference.

The TRIM module uses the trimReads.py program to perform the trimming. trimReads.py is available from GitHub and is fully documented separately. Since the running of trimReads.py is handled by the TRIM module, the user only needs to be aware of the formatting of the contaminants file and the ordering of the trimming.


trimReads.py

Trimming will happen in the following order depending on which options are selected:

  1. Remove N's from both ends.
  2. Process contaminants file, removing contaminants based on their order in the contaminants file.
  3. Single-end: discard read if shorter than the minimum length.
  4. Paired-end: if only one of the paired reads is shorter than the minimum length, then replace that read's sequence with N's and replace that read's quality scores with #. If both paired reads are shorter than the minimum length, then discard the read pair.
  5. Pad paired reads with N's so that they are both the same length. For every N that is added, also add a # to the quality score.
  6. Add "L:X" to the read header with X being the new length of the sequence (including any N's that were added to the sequence).

The contaminants file (fasta-like file) should be formatted as follows:

  1. Sequences must be on a single line (ie can't contain line breaks).
  2. No blank lines in file.
  3. Sequence header is space delimited list of options that begins with a '>' (similar to fasta files). Option names and values should be separated by a colon.
	 Example header "> name:oligo end:3 size:10 windows:5"
  1. Options:
  • name: this option is required for every sequence
  • method: there are three trimming methods (0 = full contaminant, 1 = mapped contaminant, 2 = identity based). (default 2)
    • 0) Full Contaminant: Full contaminant trimming means that when a k-mer is mapped then it is expected that the entire contaminant mapped and the read is trimmed accordingly. For example lets assume we have a k-mer that is located 4 bases from the 5' end of a contaminant. If that k-mer maps then we would shift where we trim the read by 4 bases in the direction of the 5' end of the read. We would then remove all bases from that position to the 3' end of the read, regardless of the additional bases mapped to the contaminant.
    • 1) Mapped Contaminant: Mapped contaminant trimming means that when a k-mer is mapped then we extend the mapping and trimmed accord to the mapping. For example lets assume we have a k-mer that is located 4 bases from the 5' end of a contaminant. If that k-mer maps then we would extend the mapped region one base at a time, in the 5' direction until we found a base that didn't map. We would then trim from that postion to the 3' end of the read.
    • 2) Identity Based: If a k-mer maps to the read then the location of the mapping is used to anchor the contaminant to the read. The percent and total identity between the contaminant and the read is computed. If both the percent and total identity are above a user-defined threshold then the read is trimmed from the beginning of the contaminant to the 3' end of the read. If not then the read is not trimmed.
  • size: size of k-mer (default 7)
  • windows: how many k-mers to seek. can not be larger than (contaminant length - k-mer). (default 6)
  • percentIdentity: percent identity threshold for trimming method 2 (0.0 < percentIdentity <= 1.0). (default 0.9)
  • totalIdentity: total identity threshold for trimming method 2. If this is less than the k-mer size then it will have no impact on trimming. (default 16)

Here is an example of the contents of a contaminants file:

> name:indexAdapter method:2 percentIdentity:.9 totalIdentity:16 size:10
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
> name:univAdapter method:2 percentIdentity:.9 totalIdentity:16 size:10
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Clone this wiki locally