Lightweight resources assembly algorithm for high-throughput sequencing reads. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of g-spaced sequenced kmers and the other holding kmers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools, more details about LightAssembler can be found in :
El-Metwally, S., Zakaria, M. and Hamza, T.; LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32 (21): 3215-3223. doi: 10.1093/bioinformatics/btw470.
Copyright (C) 2015-2016, and GNU GPL, by Sara El-Metwally, Magdi Zakaria and Taher Hamza.
64-bit machine with g++ compiler or gcc in general, pthreads,and zlib libraries.
- Clone the GitHub repo, e.g. with
git clone https://github.com/SaraEl-Metwally/LightAssembler.git
- Run
make
in the repo directory for k <= 31 ormake k=kmersize
for k > 31, e.g.make k=49
.
./LightAssembler -k [kmer size] -g [gap size] -e [error rate] -G [genome size] -t
[threads] -o [output prefix] [input files] --verbose
* [-k] kmer size [default: 31]
* [-g] gap size [default: 25X:3 35X:4 75X:8 140X:15 280X:25]
* [-e] error rate [default: 0.01]
* [-G] genome size [default: 0]
* [-t] number of threads [default: 1]
* [-o] output prefix file name [default: LightAssembler]
- If the gap size parameter is missing, LightAssembler invokes its parameters extrapolation module to compute the starting gap based on the sequencing coverage and the error rate of the dataset.
- The maximum read length for this version is
1024 bp
. - The maximum supported read files for this version is
100
files.
LightAssembler assembles multiple input files of the sequencing reads given in fasta/fastq format. Also, LightAssembler can read directly the input files compressed with gzip fasta.gz/fastq.gz.
The output of LightAssembler is the set of assembled contigs in fasta format, in the file:
[output prefix].contigs.fasta
LightAssembler also reports the following on the screen:
- Number of resulted contigs.
- Maximum contig length.
- Total Assembly size.
- Total genome coverage.
- Total Assembly time as well as the total time for each step.
Also, by using the --verbose
option, LightAssembler reports the additional details for each step such as the number of kmers, the false positive rate of Bloom filters and the number of branching kmers in the dataset, the average read length and the average sequencing coverage.
./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose
--- Uniform kmers sampling.
--- h(0):m(0):s(5) elapsed time.
--- total number of kmers in BloomA = 7791111
--- BloomA false positive rate = 0.00193375
--- average read length = 101
--- average sequencing coverage = 35
--- probability of an incorrect kmer appears in the sample : 0.0249524
--- Trusted/untrusted kmers filtering.
--- h(0):m(0):s(24) elapsed time.
--- total number of kmers in BloomB = 4548112
--- BloomB false positive rate = 7.7715e-05
--- Branching-kmers computation.
--- h(0):m(0):s(5) elapsed time.
--- number of branching kmers = 54644
--- Graph traversal.
--- h(0):m(0):s(16) elapsed time.
--- number of contigs = 731
--- maximum contig length = 120924
--- assembly size = 4473869
--- genome coverage = 95.4703%
--- The assembly session is finished.
--- h(0):m(0):s(31) elapsed time.
./LightAssembler -k 31 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose
--- Parameters extrapolation.
--- h(0):m(0):s(1) elapsed time.
--- start with gap size g = 4
--- average read length = 101
--- average sequencing coverage = 35
--- Uniform kmers sampling.
--- h(0):m(0):s(8) elapsed time.
--- total number of kmers in BloomA = 27604568
--- BloomA false positive rate = 0.0375047
--- probability of an incorrect kmer appears in the sample : 0.118144
--- Trusted/untrusted kmers filtering.
--- h(0):m(0):s(9) elapsed time.
--- total number of kmers in BloomB = 4655530
--- BloomB false positive rate = 9.1219e-05
--- Branching-kmers computation.
--- h(0):m(0):s(2) elapsed time.
--- number of branching kmers = 57242
--- Graph traversal.
--- h(0):m(0):s(22) elapsed time.
--- number of contigs = 747
--- maximum contig length = 127975
--- assembly size = 4474072
--- genome coverage = 95.4746%
--- The assembly session is finished.
--- h(0):m(0):s(42) elapsed time.
./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose
--- Uniform kmers sampling.
--- h(0):m(0):s(2) elapsed time.
--- Trusted/untrusted kmers filtering.
--- h(0):m(0):s(11) elapsed time.
--- Branching-kmers computation.
--- h(0):m(0):s(1) elapsed time.
--- Graph traversal.
--- h(0):m(0):s(17) elapsed time.
--- number of contigs = 731
--- maximum contig length = 120924
--- assembly size = 4473869
--- genome coverage = 95.4703%
--- The assembly session is finished.
--- h(0):m(0):s(31) elapsed time.