Skip to content
View nanobaselib's full-sized avatar

Block or report nanobaselib

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
nanobaselib/README.md

Website | Paper | Poster | Tutorials

Demo Dataset

The processed demo dataset (1.1GB) can be downloaded from Zenodo.

NanoBaseLib/
    ├── base_calling/
    ├── dataprep/
    ├──  ... 
    ├── demo_dataset/
    │    ├── 0_reference/
    │    │        └── ref.fa
    │    ├── 1_raw_signal/
    │    │        ├── single_fast5/
    │    │        ├── multi_fast5/
    │    │        └── multi_pod5/
    │    ├── 2_base_called/
    │    │        ├── dorado/
    │    │        │     └── dorado.bam
    │    │        ├── guppy/
    │    │        │     ├── fail/
    │    │        │     ├── pass/
    │    │        │     └── workspace/
    │    │        ├── demo_dataset.fastq
    │    │        └── reads-ref.sorted.filter.bam
    │    ├── 3_tailfindr/
    │    │        └──  tails.csv
    │    ├── 4_nanopolish/
    │    │        ├── eventalign.txt
    │    │        ├── eventalign_combined.txt
    │    │        ├── polya.tsv
    │    │        └── polya-pass-only-with-head.tsv
    │    ├── 5_tombo/
    │    │        └── single_reads/
    │    │                ├── tombo_resquiggle.txt
    │    │                └── tombo_summary.txt
    │    └── 6_segpore/
    └── demo_dataset_base_calling/
        ├──  fast5/
        ├──  input/
        ├──  label/
        └──  output/  

Preprocessing Pipeline

git clone https://github.com/nanobaselib/NanoBaseLib.git
cd NanoBaseLib

Download the clean demo dataset (319.1 MB), unzip it, and place it in the NanoBaseLib folder. Your directory structure should look like the example above.

Step 1: Data Standardization

If the format of your data is single-fast5, convert it into multi-fast5. Not need for demo dataset.

cd demo_dataset
single_to_multi_fast5 --input_path 1_raw_signal/single_fast5 \
--save_path 1_raw_signal/multi_fast5 \
--filename_base demo --batch_size 4000 --recursive

If Dorado base caller is needed, convert multi-fast5 to pod5.

pod5 convert fast5 1_raw_signal/multi_fast5/*.fast5  \
--output 1_raw_signal/multi_pod5 \
--one-to-one 1_raw_signal/multi_fast5

Step 2: Base Calling

guppy_basecaller -c rna_r9.4.1_70bps_hac.cfg --num_callers 20 --cpu_threads_per_caller 20 \
                 -i 1_raw_signal/multi_fast5 -s 2_base_called/guppy  --fast5_out
dorado basecaller rna002_70bps_hac@v3 1_raw_signal/multi_pod5 \
                  --estimate-poly-a > 2_base_called/dorado/dorado.bam 

Step 3: Mapping

cd 2_base_called
find guppy -type f -name "*.fastq" -exec cat {} + > demo_dataset.fastq
nanopolish index --directory=../1_raw_signal/multi_fast5 demo_dataset.fastq
minimap2 -ax map-ont --MD -t 8 --secondary=no ../0_reference/ref.fa demo_dataset.fastq | samtools sort -o reads-ref.sorted.bam -T reads.tmp

image

samtools view -b -F 2324 reads-ref.sorted.bam > reads-ref.sorted.filter.bam
samtools index reads-ref.sorted.filter.bam
samtools quickcheck reads-ref.sorted.filter.bam
samtools view -h -o reads-ref.sorted.filter.sam reads-ref.sorted.filter.bam

Step 4: PolyA Detection

nanopolish polya --threads=32 --reads=demo_dataset.fastq --bam=reads-ref.sorted.filter.bam \
                 --genome=../0_reference/ref.fa > ../4_nanopolish/polya.tsv
grep -E 'PASS|readname' ../4_nanopolish/polya.tsv > ../4_nanopolish/polya-pass-only-with-head.tsv
library(tailfindr)
df <- find_tails(fast5_dir = '2_base_called/guppy/workspace',
                 save_dir = '3_tailfindr',
                 csv_filename = 'tails.csv',
                 num_cores = 20)

Step 5: Segmentation and Event Alignmnet

nanopolish eventalign --reads demo_dataset.fastq \
--bam reads-ref.sorted.filter.bam \
--genome ../0_reference/ref.fa \
--signal-index \
--scale-events \
--summary ../4_nanopolish/summary.txt \
--threads 32 > ../4_nanopolish/eventalign.txt
cd ..
mkdir 5_tombo/single_reads
for i in `ls 2_base_called/guppy/workspace`; do multi_to_single_fast5 --input_path 2_base_called/guppy/workspace/${i}  --save_path 5_tombo/single_reads --recursive -t 20; done
tombo resquiggle 5_tombo/single_reads 0_reference/ref.fa --overwrite --processes 20

image

NanoBaseLib Commands

combine_nanopolish_eventalign

# cd NanoBaseLib
python dataprep/combine_nanopolish_eventalign.py \
--input_file demo_dataset/4_nanopolish/eventalign.txt  \
--base_type rna --n_processes 10

extract_tombo_resquiggle

python dataprep/extract_tombo_resquiggle.py --root_dir demo_dataset/5_tombo/single_reads --folder 0

generate_rna_dataset

mkdir demo_dataset_base_calling
python base_calling/generate_rna_dataset.py \
--input_file demo_dataset/4_nanopolish/eventalign.txt \
--summary_file demo_dataset/4_nanopolish/summary.txt \
--reference_file demo_dataset/0_reference/ref.fa \
--fast5_folder demo_dataset/1_raw_signal/multi_fast5 \
--save_folder demo_dataset_base_calling

image

base_calling_dataloader

python base_calling/base_calling_dataloader.py --root_dir demo_dataset_base_calling

You can use the Dataloader to develop a new model for base calling and evaluate its performance using the base_calling_evaluation.py script.

Popular repositories Loading

  1. NanoBaseLib NanoBaseLib Public

    [NeurIPS 2024 D&B] A Multi-task Benchmark Dataset for Nanopore Sequencing.

    Python 4

  2. nanobaselib.github.io nanobaselib.github.io Public

    CSS