Skip to content

A bioinformatic toolkit to align genome assemblies into pangenome graphs

License

Notifications You must be signed in to change notification settings

neherlab/pangraph

Repository files navigation

PanGraph

Documentation Docker Image Version (latest semver) Docker Pulls

a bioinformatic toolkit to align large sets of closely related genomes into a graph data structure

Warning

Pangraph is currently undergoing a major migration between v0 and v1. In this short transition period links and documentation may be inconsistent.

Overview

pangraph provides a command line interface to find homology amongst large collections of closely related genomes. The core of the algorithm partitions each genome into blocks that represent a sequence interval related by vertical descent. Each genome is then an ordered walk along blocks. The collection of all genomes form a graph that captures all observed structural diversity. pangraph is a standalone tool useful to parsimoniously infer horizontal gene transfer events within a community; perform comparative studies of genome gain, loss, and rearrangement dynamics; or simply to compress many related genomes.

The original implementation of pangraph (version v0) was implemented in Julia and was described in the publication Noll, Molari, Shaw and Neher, 2023. The current version (v1) is a reimplementation of the original algorithm in Rust by Ivan Aksamentov and Marco Molari. The new implementation should be much easier to install and is faster in many use cases.

Installation

Pangraph is available:

  • as a standalone binary
  • as a docker container

For more extended instructions on installation please refer to the documentation.

Standalone binary

This is the recommended way to install Pangraph. You can download the latest release for your operating system from here.

Docker container

PanGraph is available as a Docker container:

docker pull neherlab/pangraph:latest

See the documentation for extended instructions on its usage.

Examples

Please refer to the tutorials within the documentation for an in-depth usage guide. For a quick reference, see below.

Align a multi-fasta sequences.fa in a graph:

pangraph build sequences.fa -o graph.json

Extract the core-genome alignment from the graph, with blocks appearing in the order of the reference genome NC_010468:

pangraph export core-genome graph.json \
  --guide-strain NC_010468 \
  -o core_genome_aln.fa

Export the graph in gfa format for visualization:

pangraph export gfa graph.json -o graph.gfa

Reconstruct input sequences from the graph:

pangraph reconstruct graph.json -o sequences.fa

PyPangraph

PyPangraph is a python package with convenient utilities to load and explore the graph data structure, see the documentation for installation instructions and more examples.

import pypangraph as pp

graph = pp.Pangraph.load_graph("graph.json")
print(graph)
# pangraph object with 15 paths, 137 blocks and 1042 nodes

Citation

If you use PanGraph in scientific publications, please cite the original paper presenting the algorithm:

PanGraph: scalable bacterial pan-genome graph construction Nicholas Noll, Marco Molari, Liam P. Shaw, Richard A. Neher Microbial Genomics, 9(6), 2023; doi: 10.1099/mgen.0.001034

License

MIT License

Note

The legacy v0 version of Pangraph is now stored on the v0 branch of the repository, and legacy documentation is available here.