This repository contains a snakemake pipeline for the analysis of structural genomic evolution of E.coli ST131 presented in our paper.
The dataset consists of complete E. coli ST131 genomes available on RefSeq. Accession numbers and metadata for the considered strains can be found in the datasets folder.
In short, the pipeline uses pangraph to build a pangenome graph representation for the chromosomes of all of the considered strains. It then extracts all regions of structural variations, assigns MGEs and defense systems to each of these regions, and detect events that can be parsimoniously interpreted as simple gain or loss of sequence. See this note for an overview of the pipeline.
The pipeline produces as output a results
folder, containing processed data such as the pangenome graph and the junction graphs, and a figs
folder, containing amongst other the main figures of the paper.
- Execution requires a valid installation of conda, mamba and snakemake (v7.32.4).
- For pangenome graph creation, the pangraph command must be available in path, see pangraph documentation for installation instructions.
- optionally, to facilitate download of genbank records from ncbi, your personal api key can be saved in
config/ncbi_api_key.txt
. It will be automatically used when downloading the data.
to execute the pipeline locally, it is sufficient to run:
snakemake --use-conda --cores 1 all
You can replace 1
with the desired number of cores.
Give the high number of jobs and the memory and time requirements we advise executing on cluster. Execution using the SLURM workload manager is already set up and the pipeline can be executed with:
snakemake --profile cluster all
Evolutionary dynamics of genome structure and content among closely related bacteria
Marco Molari, Liam P. Shaw and Richard A. Neher, biorxiv (2024)
doi: https://doi.org/10.1101/2024.07.08.602537