Pipeline to integrate multiple genome assemblies into a pangenome graph representation. This pipeline wraps the functionality Minigraph for genome graph construction. We add the utility to labels nodes with the assemblies it derives, which is crucial for many pangenome analyses. Analyses which are common in pangenome studies are also performed, including:
- Determination of core and flexible genome
- Analysis of non-reference sequences
- Extraction of the structural variations
- Novel genes predictions and corresponding expression level
- Variants (SNP and Indels) called from non-reference sequences
See Here for the scheme of the pipeline and Here for an example of the generated report. Detailed description of the command run is available Here.
Developed for analysis of bovine genomes, but should be applicable to the other species as well.
Please cite our paper below when using the pipeline/scripts in your research
Danang Crysnanto, Alexander S. Leonard, Zih-Hua Fang, Hubert Pausch. Novel functional sequences uncovered through a bovine multi-assembly graph. PNAS https://doi.org/10.1073/pnas.2101056118
- Follow installation instruction to install the pipeline.
Minigraph and Gfatools need to be installed and available in $PATH
. Please download Here to get the same version used in the paper.
Required python packages, R libraries, and bioinformatic softwares are listed Here.
Alternatively, one could use mamba / conda
to create an environment with all softwares installed (Minigraph not included).
To generate pdf
report one need to install weasyprint.
conda env create -f envs/environment.yml
conda activate pangenome
- Set all parameters required in
config/config.yaml
. All paths will be interpreted relative to theworkdir
directory. - Multiple assemblies in the
workdir/assembly
with the naming scheme of {population}.fa. The prefix will be used as the identifier for the assembly. - A file
config/graph_comp.tsv
that mapping the name of the graph and the order of inclusion. First in the order used as the graph backbone, which is usually the reference genome (e.g., UCD).
Construction of multiple graphs can be specified in the different line e.g.,
graph1 UCD,OBV,Angus
graph2 UCD,Angus
- Cluster job specification in the pipeline designed for
LSF /bsub
system. We providesnakemake profile
to run on LSF system, modified from Here. One needs to adapt for the other computing clusters. Please follow guidelines Here. Pipeline can also be run locally without cluster executions.
Small dataset from three bovine assemblies (10 Mb each) located in test/assembly
folder are available for testing. Once all requirements are fullfiled, test can be done with command as below. This will be run test of pangenome construction and SV discovery with the local execution in a single core mode (< 5 mins). When test
success it will generate a pdf report in test/report
folder as in Here
snakemake -rp -j 1 -s snake_graph.py
Command below can be invoked for running on real dataset:
# local execution
snakemake -s snake_graph.py
# LSF cluster execution
snakemake --profile "snakemake_profile/lsf" -s snake_graph.py
-
Integrated pangenome graphs in
graph
folder with the prefix set in the config e.g.,graph1.gfa
-
Matrix that map node to the assembly it derives (i.e., node colours/labels), e.g.,
Node UCD OBV Angus s1 1 0 0 s2 0 1 1 *0 and 1 indicate absence and presence respectively
-
Structural variations derived from graphs.
These are large variations (fragment length > 100 bp) from bubbles in the graph that are not part of the reference sequences. The SVs are grouped by biallelic and multiallelic. SVs crossing coding sequences visualized usingGraphviz
. Script app.py can be run, which will set up a local webserver (not part of the pipeline) to inspect SVs in a more detailed and in interactive way (Under development). -
Prediction of novel /non-reference genes with corresponding expression levels from the transcriptome.
-
Variants (SNP and Indels) nested in non-reference sequences. One need to run separate pipeline for variant calling. Set config for the pipeline in Here.
-
Reports in
reports/{graph}
folder. This pdf contains summary of computational resources, statistics of core/flexible genome, structural variations, novel gene models derived from graphs. Will output a single pdf from each constructed graph. See the example Here.