Skip to content

Latest commit

 

History

History
249 lines (160 loc) · 9 KB

README.md

File metadata and controls

249 lines (160 loc) · 9 KB

NanoMAP

Quickstart / Installation

Please read about building databases prior to use!

# Obtain NanoMAP
git clone https://github.com/GraceAHall/NanoMAP
cd NanoMAP

# Build database
python build_database.py -d [your_database_path]

# Run
python nanomap.py -r fastq/fasta -d [your_database_path] -p [your_project_name]

Test Datasets

Reads:

ZymoBIOMICS Microbial Community Standard (200 Mb sample)

Database:

ZymoBIOMICS test database (21 strains per species in sample, 172 reference genomes)


Expected output:

Runtime for the ZymoBIOMICS reads and database above: approximately 2-5 mins using 4 cores.


Further reference genomes for database building:

ZymoBIOMICS reference genomes (needed if using test read set)

RefSeq Complete bacteria, fungi, viruses + latest human reference

RefSeq Complete bacteria, fungi + latest human reference


Table of Contents


System Requirements

NanoMAP uses read alignment for sample characterisation.
minimap2 is used for alignment, then NanoMAP processes the output file.

The following are required.

  • minimap2
  • python 3.6 or greater
  • python packages:
    • numpy

Overview

NanoMAP is an experimental tool for strain-level sample characterisation using long reads (Oxford Nanopore/PacBio).
It uses alignment MAPQ scores to identify sample organisms.

Once a sample has been sequenced, NanoMAP uses this sequence data and a database of reference genomes to identify the organisms present in the sample. Abudance estimates of the identified organisms are given.

Like all characterisation tools, NanoMAP requires reads, and a reference database.

Due to its method, redundant copies of a reference genome in the database will degrade performance.
Redundant genomes can be easily removed by the user if encountered during runtime. See Removing Redundancies
Poor-quality reference genomes will similarly degrade performance.

You can read more in the NanoMAP paper


Usage

After a database has been built, NanoMAP can be run with the following:

# general use
python nanomap.py -r fastq/fasta -d database -p projectname

# multithreading 
python nanomap.py -r fastq -d database -p projectname -t 10    

# limit memory usage (in Gigabytes)
python nanomap.py -r fastq -d database -p projectname -m 16 

# specify read technology
python nanomap.py -r fastq -d database -p projectname --map-ont / --map-pb        

Where:

  • -r specifies the read set (FASTA or FASTQ)
  • -d specifies the built reference database
  • -p specifies the project name for this analysis.

A new project folder - 'projects/projectname' - is created for each sample to store runtime and config files.


Databases

A NanoMAP database is simply a folder containing reference genomes.

  1. Place your reference genome FASTA files into a folder
  2. Build the database with the following:
    python build_database.py -d database    # 'database' = your genome folder path.

The build process will extract some information from FASTA headers, concatenate files into a metagenome, then create a minimap2 index for this metagenome.


Database Requirements

NanoMAP allows flexibility with databases. Any genome assembly can be used, with the following database conditions:

  • FASTA headers contain strain name
  • Each genome is a separate file
  • Genomes are good quality

The FASTA headers should be human readable as these appear in the program output. As an example:

>NC_004337.2 Shigella flexneri 2a str. 301 chromosome, complete genome

Will appear as 'Shigella flexneri 2a str. 301' in the NanoMAP output. The header text before the first space (RefSeq accession in this case) is ignored.

Genomes must be separate files, as filenames are used to uniquely identify each reference genome in the folder.

Poor quality reference genomes should be avoided as can degrade performance.


Database Building

To create a database, make a folder and populate it with FASTA files. Each reference genome needs to be its own file.
A good way to start is to download a batch of complete genome assemblies from NCBI RefSeq. This approach was used during development.

The database needs to then be built before use using the following:

# general use
python build_database.py -d database

# specify read technology
python build_database.py -d database --map-ont (nanopore) --map-pb (PacBio)

# rebuild database after adding/removing genomes
python build_database.py -d database --rebuild

# build taxonomy information only
python build_database.py -d database --taxonomy-only

# build minimap index only
python build_database.py -d database --index-only

Removing Redundancies

Large, publically available databases often have redundancies.

Your database folder is allowed to contain these redundancies. This said, some redundant genomes may need to be banned during analysis. This is performed manually by the user.

To ban a genome:

  1. Navigate to the project folder (projects/yourprojectname)
  2. Locate banlist.txt
  3. Add the genome's filename on a new line in banlist.txt

After adding the genome to banlist.txt, re-run the analysis:

python nanomap.py -r fastq/fasta -d database -p projectname --no-initial-alignment

--no-initial-alignment will save time by skipping the initial alignment step and should not influence performance.

This process may be automated in future versions of NanoMAP.


Output

NanoMAP provides two output files:

  • A brief report
  • A detailed report

These appear as: projectname_brief_report.tsv, and projectname_detailed_report.tsv


Brief report

The brief report is the intended output of NanoMAP.
It is a tab delimited file containing the following information for identified strains:

  • Strain name
  • Filename
  • Sample DNA abundance

Detailed report

NanoMAP is still under development.
Incorrect results may sometimes occur, due to low quality reads, low quality reference genomes, and database redundancies.
It is a good idea to inspect the detailed report, or the runtime console output, to catch possible errors.

During runtime, NanoMAP creates a shortlist of candidate strains (strain group) for each true sample strain.
From this shortlist, the true sample strain is identified.

The detailed report captures a snapshot of the information NanoMAP used when identifying strains.
For each shortlist, the following information is recorded:

  • Strain name
  • Filename
  • Strain group
  • Naive abundance within group
  • MAPQ=60 read count
  • MAPQ=10 read count
  • MAPQ=2 read count

Naive abundance is a very rough estimate of abundance within a strain group.

The MAPQ read counts record the number of reads which map uniquely to that strain's reference genome.
MAPQ scores are the basis for identifying strains, and will have informed NanoMAP's decisions.
In our experience, a human can often interpret the console information and projectname_detailed_report.tsv better than NanoMAP.


Licence

This project is covered under the MIT licence. You are free to use, copy, modify or distribute this software.