The empirical error models that ship with Polyester were created using the GemSIM framework1. GemSIM ships with error models estimated from real datasets sequenced on three different platforms: Illumina Genome Analyzer IIx with Illumina Sequencing Kit v4 chemistry (referred to as 'Illumina v4'), Illumina Genome Analyzer IIx with TrueSeq SBS Kit v5-GA (referred to as 'Illumina v5'), and Roche/454 FLX Titanium (referred to as 'Roche 454').
From these error models, we estimated error probabilities for each nucleotide in a read based on the read position and on the reference nucleotide at that base. Separate error probabilities were estimated for each of 4 specific sequencing errors (i.e., wrongly sequencing the base as each of the three incorrect nucleotides, or 'N').
To create the error models that ship with Polyester, we used the code in build_error_models.py
. This code depends on numpy. I ran this in Python 2.7.5 on OSX 10.9.
python build_error_models.py /path/to/GemSIM_v1.6/models /path/to/output
We then saved these models as R objects in data/
(so they can be used directly with Polyester) using the following R code:
modfolder = '/path/to/output'
platforms = c('ill100v4_mate1', 'ill100v4_mate2',
'ill100v4_single', 'ill100v5_mate1', 'ill100v5_mate2',
'ill100v5_single', 'r454ti_single')
for(platform in platforms){
i = which(platforms == platform)
model = read.table(paste(modfolder, platform, sep='/'), header=TRUE)
names(model)[1] = 'refbase'
model$refbase = substr(model$refbase, 4, 4)
eval(parse(text=paste0('model', i, ' <- model')))
save(list=paste0('model',i), file=paste0('data/', platform, '.rda'),
compress='xz')
}
McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.