GitHub - at-cg/RAFT

Introduction

Removal of contained reads has long been a weakness of overlap-layout-consensus (OLC) assemblers. RAFT (Repeat Aware Fragmentation Tool) is an algorithm designed to improve assembly quality by rescuing contained reads. RAFT breaks long reads into smaller sub-reads by following an algorithm described in our preprint. The read fragmentation allows an OLC assembler to retain contained reads during string graph construction. When input reads have non-uniform lengths, retaining contained reads improves assembly contiguity and base-level accuracy. The inputs to RAFT include an error-corrected read file in FASTA/FASTQ format and an all-vs-all alignment file in PAF format. It performs read fragmentation and outputs the fragmented reads in FASTA format.

We recommend that users use hifiasm for the initial steps (read error correction, all-vs-all overlap computation) and also for the final step (assembly of fragmented reads). The assembly output format of hifiasm is described here.

The RAFT-hifiasm workflow is recommended for long accurate reads with non-uniform length distribution (e.g., ONT Duplex, accurate ultralong ONT Simplex). ONT UL reads can optionally be integrated during the final assembly step.

Try RAFT-hifiasm Workflow on Small Test Data

The entire test workflow below will take about 3-4 minutes. Users can either run the commands one by one or copy the commands into an executable script.

# Install RAFT 
git clone https://github.com/at-cg/RAFT.git
cd RAFT && make && cd ..

# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make -j4 && cd ..

mkdir -p assembly && cd assembly/

# Get small test data
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz

# Estimate coverage -- extract the total number of bases (sum_len)
# and divide by estimated length of genome (2M)
seqkit stats chr11-2m.fa.gz
# coverage = 84,042,542/2,000,000
COVERAGE=42

# First run of hifiasm with 4 threads to obtain error corrected reads
../hifiasm/hifiasm -o errorcorrect -t4 --write-ec chr11-2M.fa.gz 2> errorcorrect.log

# Second run of hifiasm to obtain all-vs-all read overlaps as a paf file
../hifiasm/hifiasm -o getOverlaps -t4 --dbg-ovec errorcorrect.ec.fa 2> getOverlaps.log
# Merge cis and trans overlaps
cat getOverlaps.0.ovlp.paf getOverlaps.1.ovlp.paf > overlaps.paf

# RAFT fragments the error corrected reads
../RAFT/raft -e ${COVERAGE} -o fragmented errorcorrect.ec.fa overlaps.paf

# Final hifiasm run to obtain assembly of fragmented reads
# A single round of error correction (-r1) is enough here
../hifiasm/hifiasm -o finalasm -t4 -r1 fragmented.reads.fasta 2> finalasm.log
ls finalasm*p_ctg.gfa

For large inputs, users are recommended to increase the thread count depending on the number of the cores available for use. RAFT-hifiasm workflow takes about 9 hours and ~100 GB RAM using 128 threads on a multicore Perlmutter CPU-based node to process 32x ONT Duplex human data.

Usage Details

raft [options] <input-reads.fa> <in.paf>

The following options can be used to customize the behavior of the program. The default values are set if there is no custom requirement.

-r INT: Set the resolution of local coverage [50]
-e INT: Set the estimated coverage
-m NUM: Set the coverage multiplier for high coverage [1.5]
-l INT: Set the desired read length [20000]
-p INT: Set the minimum repeat length to be preserved [5000]
-f INT: Set the flanking length for repeats [1000]
-v INT: Set the overlap length between fragmented reads [500]
-o STR: Set the prefix of output files ["raft"]

Installation

Clone the repository:

git clone https://github.com/at-cg/RAFT.git

Compile the source code:

cd RAFT
make

Examples

Run RAFT with estimated coverage 20:

raft -e 20 -m 1.3 -o output <input_reads> <input_overlaps>

Run RAFT with custom parameters:

raft -e 20 -m 1.3 -p 7000 -f 500 -v 500 -l 15000 -o output <input_reads> <input_overlaps>

Output Files

RAFT outputs the following files:

Coverage information for each read in `prefix`.coverage.txt
For input reads simulated using seqrequester, it outputs additional information for debugging
1. the positions of long repeats in reference contigs in `prefix`.long_repeats.bed
2. the positions of long repeats in reads in `prefix`.long_repeats.txt

Preprint

Sudhanva Shyam Kamath, Mehak Bindra, Debnath Pal, Chirag Jain. Telomere-to-telomere assembly by preserving contained reads. Biorxiv (November 2023).

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
bash_scripts		bash_scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
chop.hpp		chop.hpp
kseq.h		kseq.h
main.cpp		main.cpp
overlap.hpp		overlap.hpp
paf.hpp		paf.hpp
param.hpp		param.hpp
read.hpp		read.hpp
repeat.hpp		repeat.hpp
run.sh		run.sh
split_naive.cpp		split_naive.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Try RAFT-hifiasm Workflow on Small Test Data

Usage Details

Installation

Examples

Output Files

Preprint

About

Releases

Packages

Contributors 3

Languages

at-cg/RAFT

Folders and files

Latest commit

History

Repository files navigation

Introduction

Try RAFT-hifiasm Workflow on Small Test Data

Usage Details

Installation

Examples

Output Files

Preprint

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages