Please read about building databases prior to use!
# Obtain NanoMAP
git clone https://github.com/GraceAHall/NanoMAP
cd NanoMAP
# Build database
python build_database.py -d [your_database_path]
# Run
python nanomap.py -r fastq/fasta -d [your_database_path] -p [your_project_name]
Reads:
ZymoBIOMICS Microbial Community Standard (200 Mb sample)
Database:
ZymoBIOMICS test database (21 strains per species in sample, 172 reference genomes)
Expected output:
Runtime for the ZymoBIOMICS reads and database above: approximately 2-5 mins using 4 cores.
Further reference genomes for database building:
ZymoBIOMICS reference genomes (needed if using test read set)
RefSeq Complete bacteria, fungi, viruses + latest human reference
RefSeq Complete bacteria, fungi + latest human reference
NanoMAP uses read alignment for sample characterisation.
minimap2 is used for alignment, then NanoMAP processes the output file.
The following are required.
- minimap2
- python 3.6 or greater
- python packages:
- numpy
NanoMAP is an experimental tool for strain-level sample characterisation using long reads (Oxford Nanopore/PacBio).
It uses alignment MAPQ scores to identify sample organisms.
Once a sample has been sequenced, NanoMAP uses this sequence data and a database of reference genomes to identify the organisms present in the sample. Abudance estimates of the identified organisms are given.
Like all characterisation tools, NanoMAP requires reads, and a reference database.
Due to its method, redundant copies of a reference genome in the database will degrade performance.
Redundant genomes can be easily removed by the user if encountered during runtime. See Removing Redundancies
Poor-quality reference genomes will similarly degrade performance.
You can read more in the NanoMAP paper
After a database has been built, NanoMAP can be run with the following:
# general use
python nanomap.py -r fastq/fasta -d database -p projectname
# multithreading
python nanomap.py -r fastq -d database -p projectname -t 10
# limit memory usage (in Gigabytes)
python nanomap.py -r fastq -d database -p projectname -m 16
# specify read technology
python nanomap.py -r fastq -d database -p projectname --map-ont / --map-pb
Where:
- -r specifies the read set (FASTA or FASTQ)
- -d specifies the built reference database
- -p specifies the project name for this analysis.
A new project folder - 'projects/projectname' - is created for each sample to store runtime and config files.
A NanoMAP database is simply a folder containing reference genomes.
- Place your reference genome FASTA files into a folder
- Build the database with the following:
python build_database.py -d database # 'database' = your genome folder path.
The build process will extract some information from FASTA headers, concatenate files into a metagenome, then create a minimap2 index for this metagenome.
NanoMAP allows flexibility with databases. Any genome assembly can be used, with the following database conditions:
- FASTA headers contain strain name
- Each genome is a separate file
- Genomes are good quality
The FASTA headers should be human readable as these appear in the program output. As an example:
>NC_004337.2 Shigella flexneri 2a str. 301 chromosome, complete genome
Will appear as 'Shigella flexneri 2a str. 301' in the NanoMAP output. The header text before the first space (RefSeq accession in this case) is ignored.
Genomes must be separate files, as filenames are used to uniquely identify each reference genome in the folder.
Poor quality reference genomes should be avoided as can degrade performance.
To create a database, make a folder and populate it with FASTA files. Each reference genome needs to be its own file.
A good way to start is to download a batch of complete genome assemblies from NCBI RefSeq. This approach was used during development.
The database needs to then be built before use using the following:
# general use
python build_database.py -d database
# specify read technology
python build_database.py -d database --map-ont (nanopore) --map-pb (PacBio)
# rebuild database after adding/removing genomes
python build_database.py -d database --rebuild
# build taxonomy information only
python build_database.py -d database --taxonomy-only
# build minimap index only
python build_database.py -d database --index-only
Large, publically available databases often have redundancies.
Your database folder is allowed to contain these redundancies. This said, some redundant genomes may need to be banned during analysis. This is performed manually by the user.
To ban a genome:
- Navigate to the project folder (projects/yourprojectname)
- Locate banlist.txt
- Add the genome's filename on a new line in banlist.txt
After adding the genome to banlist.txt, re-run the analysis:
python nanomap.py -r fastq/fasta -d database -p projectname --no-initial-alignment
--no-initial-alignment will save time by skipping the initial alignment step and should not influence performance.
This process may be automated in future versions of NanoMAP.
NanoMAP provides two output files:
- A brief report
- A detailed report
These appear as: projectname_brief_report.tsv, and projectname_detailed_report.tsv
Brief report
The brief report is the intended output of NanoMAP.
It is a tab delimited file containing the following information for identified strains:
- Strain name
- Filename
- Sample DNA abundance
Detailed report
NanoMAP is still under development.
Incorrect results may sometimes occur, due to low quality reads, low quality reference genomes, and database redundancies.
It is a good idea to inspect the detailed report, or the runtime console output, to catch possible errors.
During runtime, NanoMAP creates a shortlist of candidate strains (strain group) for each true sample strain.
From this shortlist, the true sample strain is identified.
The detailed report captures a snapshot of the information NanoMAP used when identifying strains.
For each shortlist, the following information is recorded:
- Strain name
- Filename
- Strain group
- Naive abundance within group
- MAPQ=60 read count
- MAPQ=10 read count
- MAPQ=2 read count
Naive abundance is a very rough estimate of abundance within a strain group.
The MAPQ read counts record the number of reads which map uniquely to that strain's reference genome.
MAPQ scores are the basis for identifying strains, and will have informed NanoMAP's decisions.
In our experience, a human can often interpret the console information and projectname_detailed_report.tsv better than NanoMAP.
This project is covered under the MIT licence. You are free to use, copy, modify or distribute this software.