Skip to content

Snakemake workflow to map and quantify microbial genomes in metagenomic reads

License

Notifications You must be signed in to change notification settings

alexmsalmeida/metamap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metaMap - quantifying genomes in metagenomes

Snakemake workflow to map and quantify prevalence and abundance of genomes in metagenomes.

Paired metagenomic reads are mapped to a genome database with BWA and filtered based on nucleotide identity, mapping score and alignment fraction with SAMtools. Read counts, coverage and depth are calculated per genome for each metagenome and final summary files are generated across samples. Some of these concepts are discussed in great detail by Matt Olm (developer of dRep and inStrain) here.

Installation

  1. Install conda and snakemake

  2. Clone repository

git clone https://github.com/alexmsalmeida/metamap.git

How to run

  1. Edit config.yml file to point to the input, output and database directories/files. Input file should contain tab-separated columns with the paths to the forward and reverse reads of each metagenome to map (ending in _1.fastq.gz and _2.fastq.gz, respectively). The database folder should point to a multi-FASTA file containing the reference genomes, indexed with BWA (bwa index). If mapping to a reference database where each sequence represents its own genome (i.e, when mapping against a viral sequence catalog) change the option type in the config.yml from "contigs" to "complete".

Note: If using the "contigs" option with a combined set of reference genomes make sure the names of the contigs in the final FASTA file are structured as [genome_name]_1, [genome_name]_2, etc. You can use the rename_multifasta_prefix.py script here provided in the tools/ folder to rename each genome FASTA before concatenating the files into one.

  1. (option 1) Run the pipeline locally (adjust -j based on the number of available cores)
snakemake --use-conda -k -j 4
  1. (option 2) Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A {cluster.project} -p {cluster.queue} --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output} --time={cluster.time}'

Output

The main output is located in the directory summary/ which contains four files:

  • bwa_counts-total.csv: read counts per genome across all samples including multi-mapped reads.
  • bwa_counts-unique.csv: read counts per genome across all samples excluding multi-mapped reads.
  • bwa_cov-est.csv: breadth of coverage per genome across all samples.
  • bwa_cov-exp.csv: expected breadth of coverage per genome across all samples based on their level of read depth. This calculation is further discussed here.

About

Snakemake workflow to map and quantify microbial genomes in metagenomic reads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published