Parallel metegenome assembler for large soil datasets

Authors

Chirag Jain
Patrick Flick
Tony Pan

Introduction

Parallel metagenomic assembler designed to handle very large datasets. Program identifies the disconnected subgraphs in the de Bruijn graph, partitions the input dataset and runs a popular assember Velvet independently on the partitions. This software is a high performance version of the khmer library for assembly.

The whole algorithm relies on the existence of disconnected components in the de Bruijn graph for the performance gains. We found that this assumption is generally true for the currently available soil datasets from forests and agriculture land.

Install

The repository and external submodules can be cloned directly:

git clone --recursive <GITHUB_URL>
mkdir build_directory
cd build_directory
cmake ../Metag_partitioning
make
make

Run

Inside the build directory,

mpirun -np <COUNT OF PROCESSES> ./bin/metaG --file <FASTQ_FILE> --velvetK <KMER_SIZE_FOR_ASSEMBLY>
Eg. mpirun -np 8 ./bin/metaG --file sample.fastq --velvetK 45

We have some sample files in the data folder of the code, you can use those for trial runs. You should see a file called contigs.fa containing all the assembled contigs after the run is successful.

Customization (required)

During the assembly, velvet does file I/O to save intermediate results. Therefore you need to specify the paths suitable for it. Please check the files include/config files . These 2 files contain all the parameters that can be tuned by the users.

Dependency

It requires C++ 11 features (gcc 4.7 or above) and MPI

External git submodules (automatically downloaded and compiled):

Cite

Please cite the following publication if you are using this code for your research:

A Parallel Connectivity Algorithm for de-Bruijn Graphs in Metagenomic Applications. Patrick Flick, Chirag Jain, Tony Pan, and Srinivas Aluru. Proceedings of 2015 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2015.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
data		data
ext		ext
include		include
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel metegenome assembler for large soil datasets

Authors

Introduction

Install

Run

Customization (required)

Dependency

Cite

About

Releases

Packages

Contributors 3

Languages

ParBLiSS/metag_partitioning

Folders and files

Latest commit

History

Repository files navigation

Parallel metegenome assembler for large soil datasets

Authors

Introduction

Install

Run

Customization (required)

Dependency

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages