The Precise Basecalling of Short-Read Nanopore Sequencing

Introduction

Nucleotide modifications deviate nanopore sequencing readouts, therefore generating artifacts during the basecalling of sequence backbones. Here, we present an iterative approach to polish modification-disturbed basecalling results. We show such an approach is able to promote the basecalling accuracy of both artificially-synthesized and real-world molecules. With demonstrated efficacy and reliability, we exploit the approach to precisely basecall therapeutic RNAs consisting of artificial or natural modifications, as the basis for quantifying the purity and integrity of vaccine mRNAs which are transcribed in vitro, and for determining modification hotspots of novel therapeutic RNA interference (RNAi) molecules which are bioengineered (BioRNA) in vivo.

Major Contribution: 3-step sampling

Our study shows that compromised basecalling can be improved through an iterative workflow. To enhance polishing at the 3’ and 5’ ends, which is crucial for short reads, we developed a 3-step sampling strategy. Reads are sampled from the 5’ end, the full molecule, and the 3’ end, ensuring even coverage and better basecalling at both termini.

Installation

Pre-request

singularity
git

Step 1: git clone

git clone https://github.com/wangziyuan66/iterative-labeling-toolkit-bonito

Step 2: build sif file

singularity build bonito.sif bonito.recipe

Create a standalone envrionment to run the iterative-labeling-bonito.

Usage

# raw=$1

# reference=$2

# bonito=$3

# basecall=$4

bash scripts/sr_iterative_labelling.sh raw reference ./ ./scripts/sr_basecall.py

raw : The path to folder containing raw pod5 files.
reference : Reference genome path.
bonito : The path to the bonito singularity image.
basecall : The path to the "basecall.py" file.

Miscellaneous

If the sequencing kit is RNA002, we recommend you to use [iterative-labeling-toolkit-taiyaki(https://github.com/wangziyuan66/iterative-labeling-toolkit-taiyaki). Currently, for the first round of basecalling we are using RNA004 hac 5.0.0 model.

Data availability

Sample raw pod5 files are provided for BioRNA-Leu BioRNA-Ser ChemoRNA-Leu and ChemoRAN-Ser which you can downloaded in here. In sra, only bam can be uploaded. If some need more rawdata, contact us.

Contact

Ziyuan Wang princezwang@arizona.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The Precise Basecalling of Short-Read Nanopore Sequencing

Introduction

Major Contribution: 3-step sampling

Installation

Pre-request

Step 1: git clone

Step 2: build sif file

Usage

Miscellaneous

Data availability

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

The Precise Basecalling of Short-Read Nanopore Sequencing

Introduction

Major Contribution: 3-step sampling

Installation

Pre-request

Step 1: git clone

Step 2: build sif file

Usage

Miscellaneous

Data availability

Contact