DeepBam: High-Accuracy CpG Methylation Calling with Nanopore Sequencing

Brief Introduction

The recent R10.4 nanopore sequencing system offers improved base-calling accuracy and growing potential for genomic CpG methylation analysis. However, the official Dorado model demonstrates inconsistent performance across datasets. To address this, we developed DeepBam, a robust deep neural network-based CpG methylation calling model.

DeepBam achieves superior accuracy and stability, with an average AUC of 97.80%, balanced accuracy of 95.96%, and F1 score of 94.97% across diverse human and plant datasets. It outperforms Dorado with methylation frequency correlations exceeding 0.95 compared to BS-seq in most datasets and reveals haplotype-specific methylation patterns, even in partially repetitive regions.

Built on Bi-LSTM architecture, DeepBam integrates Python for training and C++ with libtorch for high-performance feature extraction and modification calling, offering unmatched precision and scalability for nanopore-based CpG methylation studies.

Key Features

Efficiently read input data (pod5 and bam) using lib-pod5 and htslib.
Implement efficient feature extraction from large volumes of pod5 and BAM files with a thread pool.
Continuously optimize CPU memory usage and runtime performance.
Perform GPU inference with half-precision to significantly improve model efficiency with minimal impact on accuracy.

Building from Scratch

Building the C++ Program

DeepBam was tested and optimized in NVIDIA GeForce RTX 3090, ensure you have a GPU and CUDA Toolkit 11.8 installed. Download libtorch 2.0.1 if it's not already included in your Python environment. This C++ program is compiled using g++-11.2 on Ubuntu 22.04. Compatibility issues may arise on other systems, so feel free to raise an issue if you encounter any problems.

If you are not familiar about how to install CUDA Toolkit 11.8, here is a example for set up CUDA Toolkit 11.8 in ubuntu 22.04 x86_64 system

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

Install the following packages before building the program:

boost
spdlog: a Fast C++ logging library
zlib

And the these projects are already included in 3rdparty/

argparse: Argument Parser for Modern C++
pod5: C++ abi for nanopore pod5-file-format
cnpy: library to read/write .npy and .npz files in C/C++
ThreadPool: A simple C++11 Thread Pool implementation (slightly modified from the original version in github)

git clone https://github.com/huicongyao/Deep-Bam.git
cd Deep-Bam/cpp
mkdir build && cd build
conda activate DeepBam # Activate the previously created environment
cmake -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` .. # Determine the cmake path # if you haven`t set up the python environment, you should directy include libtorch path here.
make -j

DeepBam Usage

After successfully building the program, you can use our pre-trained model or train your own. The executable is located at Deep-Bam/cpp/build/DeepBam.

DeepBam: Extracting High-Confidence Sites

This process extracts features for model training.

Usage: extract_hc_sites [--help] [--version] pod5_dir bam_path reference_path ref_type write_dir pos neg kmer_size num_workers sub_thread_per_worker motif_type loc_in_motif

extract features for model training with high confident bisulfite data

Positional arguments:
  pod5_dir               path to pod5 directory
  bam_path               path to bam file, sorted by file name is needed
  reference_path         path to reference genome
  ref_type               reference genome tyoe [default: "DNA"]
  write_dir              write directory, write file format ${pod5filename}.npz which contains extrated features and its site info
  pos                    positive high accuracy methylation sites
  neg                    negative high accuracy methylation sites
  kmer_size              kmer size for extract features [default: 51]
  num_workers            maximum Pod5 files that process parallelly [default: 10]
  sub_thread_per_worker  num of sub thread per worker, total sub thread equals (sizeof(pod5) + 100M) / 100M * sub_thread_per_worker [default: 4]
  motif_type             motif_type default CG [default: "CG"]
  loc_in_motif           Location in motifset

Optional arguments:
  -h, --help             shows help message and exits
  -v, --version          prints version information and exits

The extracted features are saved as npz files containing site information and data. Site info is stored as a tab-delimited string in a uint8 array, and the data array is used for training.

The extract_hc_sites mode allows training of customized models on your data. After extraction, run the script py/train_lstm.py to train your model. Refer to the README.md in the py directory for further instructions.

DeepBam: Extract and Call Modifications

The process for calling modifications.

Usage: extract_and_call_mods [--help] [--version] pod5_dir bam_path reference_path ref_type write_file module_path kmer_size num_workers sub_thread_per_worker batch_size motif_type loc_in_motif

asynchronously extract features and pass data to model to get modification result

Positional arguments:
  pod5_dir               path to pod5 directory
  bam_path               path to bam file, sorted by file name is needed
  reference_path         path to reference genome
  ref_type               reference genome type [default: "DNA"]
  write_file             write detailed modification result file path
  module_path            module path to trained model
  kmer_size              kmer size for extract features [default: 51]
  num_workers            maximum Pod5 files that process parallelly [default: 10]
  sub_thread_per_worker  num of sub thread per worker, total sub thread equals (sizeof(pod5) + 100M) / 100M * sub_thread_per_worker [default: 4]
  batch_size             default batch size [default: 1024]
  motif_type             motif_type default CG [default: "CG"]
  loc_in_motif           Location in motifset

Optional arguments:
  -h, --help             shows help message and exits
  -v, --version          prints version information and exits

The call_mods process outputs a tsv file containing the following data:

read_id
reference_start: Start position of the read on the reference genome
reference_end: End position of the read on the reference genome
chromosome: Reference name of the read on the reference genome
pos_in_strand: Position of the current CpG site on the reference genome
strand: Aligned strand of the read on the reference (+/-)
methylation_rate: Methylation rate of the current CpG sites as determined by the model.

You could find trained torch script modules in traced_script_module file that contains different k-mer.

Publication

Our work has been published in Brifings in Bioinformatics. If you used this project in your research, please cite

Xin Bai, Hui-Cong Yao, Bo Wu, Luo-Ran Liu, Yu-Ying Ding, Chuan-Le Xiao, DeepBAM: a high-accuracy single-molecule CpG methylation detection tool for Oxford nanopore sequencing, Briefings in Bioinformatics, Volume 25, Issue 5, September 2024, bbae413, https://doi.org/10.1093/bib/bbae413

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
3rdparty		3rdparty
Call_Modification		Call_Modification
DataLoader		DataLoader
Feature_Extractor		Feature_Extractor
utils		utils
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepBam: High-Accuracy CpG Methylation Calling with Nanopore Sequencing

Brief Introduction

Key Features

Building from Scratch

Building the C++ Program

DeepBam Usage

DeepBam: Extracting High-Confidence Sites

DeepBam: Extract and Call Modifications

Publication

About

Releases

Packages

Contributors 2

Languages

huicongyao/Deep_BAM

Folders and files

Latest commit

History

Repository files navigation

DeepBam: High-Accuracy CpG Methylation Calling with Nanopore Sequencing

Brief Introduction

Key Features

Building from Scratch

Building the C++ Program

DeepBam Usage

DeepBam: Extracting High-Confidence Sites

DeepBam: Extract and Call Modifications

Publication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages