Thesis title: Evaluating current bioinformatics tools for microbial epigenetics analysis in long-read sequencing data
Supervisors: Dr Tomasz Kurowski and Dr Alexey Larionov
DNA methylation is one of the main microbial epigenetic mechanism. Microbial DNA methylation differs from that in eukaryotes as it targets specific sequence motifs and primarily involves N6 adenine and N4 cytosine methylation, although C5 cytosine methylation can also be present. The main function of DNA methylation in bacteria is protection against foreign DNA through restriction-modification systems. Additionally, it can also contribute to regulation of replication, DNA mismatch repair and epigenetic regulation of gene expression. With the introduction of third generation long-read sequencing by PacBio and ONT platforms it has become possible to directly read bases of a single molecule and also directly detect base modifications. These technologies are constantly refined alongside bioinformatic analysis tools, providing improved precision and accuracy. Therefore, it is important to keep up with these advancements and identify best practices in analysing microbial epigenetic marks in order to gain useful and biologically relevant insights into functioning of these microorganisms. In this project, different analysis approaches were explored using latest versions of PacBio SMRT Tools and MicrobeMod toolkit, and a new custom devised pipeline was also suggested for PacBio data analysis. All pipelines successfully identified main motifs with highest levels of methylation and abundance in the genomes, whereas other identified motifs showed more variation between different analysis protocols. Methylated motifs included common targets of solitary MTases in proteobacteria: GATC (in M. xanthus) and GANTC (in R. leguminosarum), with over 90% and 40% methylation respectively. Additionally, homologues of genes coding for relevant MTases (targeting these motifs) were also identified, providing further confirmation for obtained results. Utility and limitations of each analysis approach were thoroughly explored, providing a comprehensive overview of current analysis options and paving the way to future, more complex metaepigenomic analysis.
Species |
Strain |
System |
Platform |
Format |
Ref. genome |
Ref. |
|
---|---|---|---|---|---|---|---|
Streptococcus agalactiae |
clinical isolate NEM316 |
PacBio |
Sequel II |
BAM (CCS) |
https://www.ncbi.nlm.nih.gov/nuccore/NC_004368.1 |
Manzer & Doran, 2024 |
|
Streptococcus agalactiae |
clinical isolate CJB111 |
PacBio |
Sequel II |
BAM (CCS) |
https://www.ncbi.nlm.nih.gov/nuccore/CP063198 |
Manzer & Doran, 2024 |
|
Myxococcus xanthus |
DZ2 |
PacBio |
Sequel II |
BAM (subreads) |
https://www.ncbi.nlm.nih.gov/nuccore/CP070500 |
Jain et al., 2021 |
|
Rhizobium leguminosarum |
ATCC 10004 |
ONT |
MinION R10.4.1 (5 kHz) |
POD5 |
Provided with the data |
Crits-Christoph et al., 2023 |
Crits-Christoph, A., Kang, S. C., Lee, H. H., & Ostrov, N. (2023). MicrobeMod: A computational toolkit for identifying prokaryotic methylation and restriction-modification with nanopore sequencing. BioRxiv, 2023.11.13.566931. https://doi.org/10.1101/2023.11.13.566931
Jain, R., Habermann, B. H., Mignot, T., & Stewart, F. J. (2021). Complete Genome Assembly of Myxococcus xanthus Strain DZ2 Using Long High-Fidelity (HiFi) Reads Generated with PacBio Technology. Microbiology Resource Announcements, 10(28). https://doi.org/10.1128/MRA.00530-21
Manzer, H. S., & Doran, K. S. (2024). Complete m6A and m4C methylomes for group B streptococcal clinical isolates CJB111, A909, COH1, and NEM316. Microbiology Resource Announcements, 13(1). https://doi.org/10.1128/MRA.00733-23/ASSET/3F8A8632-BC9D-48FC-A231-B305F66EFA60/ASSETS/IMAGES/LARGE/MRA.00733-23.F001.JPG
This repository contains scripts used for analysing DNA methylation in bacteria. Following pipelines were used for the analysis:
Microbial genome analysis workflow has several stages that include: assembly of large contigs, assembly of plasmids, alignment of input data to assembled contigs, polishing and base modification detection. The assembly steps are done by IPA tool for HiFi genome assembly and polishing is performed by Racon. Modification detection includes both detection of methylation and identification of methylated motifs. The input files have to be provided in XML format. To make XML file using raw BAM files ‘dataset create’ SMRT tool was used. PacBio uses Cromwell as its official workflow manager and has its own wrapper for it – ‘pbcromwell’ that is used to run the analysis by ‘pbcromwell run pb_microbial_analysis’ command.
Microbial genome analysis workflow
For the analysis of PacBio data additional custom pipeline was designed focusing on DNA methylation analysis and omitting the genome assembly steps as all of the analysed data have appropriate reference genome files available. First steps in the pipeline include conditional data preprocessing with ‘ccs-kinetics-bystrandify’ tool and XML file creation with ‘dataset create’. Reference and datase XML files were used as input to ‘pb_align_ccs’ workflow that aligns data to reference and outputs alignment BAM and coverage GFF files. Alignment file was then used as input for ‘ipdSummary’ tool that identifies modifications (m6A and m4C). Its output, a GFF file, was input for ‘motifMaker’ tool, performing motif identification and creating CSV file of motifs and updating modification GFF file with motif information. The coverage GFF file from the alignment step was used by ‘summarizeModifications’ tool to create modification summary GFF file, and a custom script was made to plot the data.
Custom designed DNA methylation analysis pipeline
ONT - MicrobeMod https://github.com/cultivarium/MicrobeMod
ONT data was first preprocessed by basecalling with Dorado, followed by mapping reads to reference by minimap2 (Figure 2 3). Basecalling was done by using either latest Dorado model (v5.0.0.) or Rerio research model that is optimised for highly methylated bacterial DNA. In addition to that, compatible official Dorado modification models were used for m6A, m4C and 5mC methylation calling.
MicrobeMod call_methylation workflow workflow first identifies methylated sites and extracts methylation frequencies with Modkit tool. Next, 24 bases long sequences, surrounding highly methylated positions, are extracted and analysed with STREME that can identify significantly enriched motifs. Finally, fraction of methylated motif occurrences is recorded.
MicrobeMod annotate_rm can identify genes potentially involved in DNA methylation and restriction. This is achieved by the use of prodigal, HHMER and cath-resolve-hits tools. Next, BLASTP is used to find gene homologs in REBASE database, in order to, when available, include additional information on target motifs for MTases coded by identified genes.
MicrobeMod toolkit workflow