metaMap - quantifying genomes in metagenomes

Snakemake workflow to map and quantify prevalence and abundance of genomes in metagenomes.

Paired metagenomic reads are mapped to a genome database with BWA and filtered based on nucleotide identity, mapping score and alignment fraction with SAMtools. Read counts, coverage and depth are calculated per genome for each metagenome and final summary files are generated across samples. Some of these concepts are discussed in great detail by Matt Olm (developer of dRep and inStrain) here.

Installation

Install conda and snakemake
Clone repository

git clone https://github.com/alexmsalmeida/metamap.git

How to run

Edit config.yml file to point to the input, output and database directories/files. Input file should contain tab-separated columns with the paths to the forward and reverse reads of each metagenome to map (ending in _1.fastq.gz and _2.fastq.gz, respectively). The database folder should point to a multi-FASTA file containing the reference genomes, indexed with BWA (bwa index). If mapping to a reference database where each sequence represents its own genome (i.e, when mapping against a viral sequence catalog) change the option type in the config.yml from "contigs" to "complete".

Note: If using the "contigs" option with a combined set of reference genomes make sure the names of the contigs in the final FASTA file are structured as [genome_name]_1, [genome_name]_2, etc. You can use the rename_multifasta_prefix.py script here provided in the tools/ folder to rename each genome FASTA before concatenating the files into one.

(option 1) Run the pipeline locally (adjust -j based on the number of available cores)

snakemake --use-conda -k -j 4

(option 2) Run the pipeline on a cluster (e.g., SLURM)

snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A {cluster.project} -p {cluster.queue} --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output} --time={cluster.time}'

Output

The main output is located in the directory summary/ which contains four files:

bwa_counts-total.csv: read counts per genome across all samples including multi-mapped reads.
bwa_counts-unique.csv: read counts per genome across all samples excluding multi-mapped reads.
bwa_cov-est.csv: breadth of coverage per genome across all samples.
bwa_cov-exp.csv: expected breadth of coverage per genome across all samples based on their level of read depth. This calculation is further discussed here.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
envs		envs
test		test
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
cluster.yml		cluster.yml
config.yml		config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metaMap - quantifying genomes in metagenomes

Installation

How to run

Output

About

Releases

Packages

Contributors 2

Languages

License

alexmsalmeida/metamap

Folders and files

Latest commit

History

Repository files navigation

metaMap - quantifying genomes in metagenomes

Installation

How to run

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages