Skip to content

This repo contains code for detecting bacterial phase variations

Notifications You must be signed in to change notification settings

uzunmasha/VarHunter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics Institute Course Project 2024

VarHunter

Maria Uzun, Nikita Zherko

Supervised by Natalia Dranenko


This tool is designed to detect bacterial phase variations.

Phase variation is the ability of pathogenic bacteria to alter their surface proteins to evade the host immune response.

This script operates in cases where switching between variants of surface proteins occurs due to inversion events affecting two proteins encoding these variants.

Phase variation process

The pipeline consists of the following steps:

  • Generation of gene pairs from different genomes and performing codon alignment using PRANK

  • Segmentation of aligned sequences and calculating the percentage of mismatches in each segment

  • Detection of phase variations based on mismatches values For each row in the mismatches table (corresponding to one gene pair), the following conditions are checked:

    • Presence of only one maximum value.
    • Presence of a sharp jump in values in the vicinity of the maximum point.
    • 70% of values below the set limit (default 20%).
    • 9% of values above the set limit (default 40%).
    • Absence of long end gaps.

    An alignment passes the test if all the above conditions are met. If so, three more alignments for genes from the established genomes are sought, connected to this alignment. If a total of four alignments are found, two with a profile showing an extremum followed by a plateau, and two with a plateau followed by a profile showing an extremum, the names of two genomes and four genes are saved in a dataframe called pre-variations. Subsequently, the matching profiles in each pre-variation are checked, and the final dataframe with variations is outputted.

  • The script also generates plots based on the mismatch values.

Installation

To get the tool clone the git repository:

git clone git@github.com:uzunmasha/VarHunter.git && cd VarHunter

Create a conda/mamba environment with necessary packages and activate it:

conda env create -f environment.yml
conda activate varhunter

Usage

To run the script, call it from the directory where the tool is located:

python varhunter.py -g <genomes_csv> -l <lst_info_file> -f <fasta_file> -s <selection_condition>

Usage options:

  -h, --help            show this help message and exit

Required arguments:
  -g, --genomes [GENOMES_CSV]
                        Path to the CSV file containing genome information
  -l, --lst_info_file [LST_INFO_FILE]
                        Path to the LSTINFO file containing additional information
  -f, --fasta_file [FASTA_FILE]
                        Path to the FASTA file containing gene sequences
  -s, --selection_condition [SELECTION_CONDITION]
                        Selection condition(s) for filtering gene pairs

Optional arguments:
  -d, --division_factor [DIVISION_FACTOR]
                        Number to divide alignment length for segment calculation (default: 20)
  -p, --processes [PROCESSES]
                        Number of processes to use for parallel execution (default: 1)

Input file formats:

Input file formats can be found in the raw_data folder. Basically, they were obtained after genome annotations using PanACoTA

  • Genomes_csv contains names of bacterial genera, species, and strains, along with their RefSeq accessions.
  • Lst_info_file the sequential numbers of genomes in the format 4 letters, 4 digits, 5 digits, as well as how these sequential numbers correspond to the data in the Genomes_csv file.
  • Fasta_file contains target genes from analyzed genomes (e.g., histidine triad genes) from analyzed genomes.

Expected output:

If phase variations are detected, an output directory is created with the following content:

  • csv file with a list of genomes and their specific genes involved in phase variations
  • directory with alignments corresponded to gene pairs involved in phase variations
  • directory with plots corresponding to gene pairs mismatches involved in phase variations

Examples

To test the code functionality and output format, run it on the test data

python varhunter.py -g test_data/Genomes_names_Streptococci.csv -l test_data/Lst_info_Streptococci.lst -f test_data/Histidine_triads_Streptococci_nt.fasta -s "Streptococcus pneumoniae" -p 6

Troubleshooting

  • When running the script again, if folders with pairs and their alignments already exist, then these steps are skipped in order to save calculation time. If you want the code to complete all stages, these folders must be deleted manually.
  • At the moment, the code works reliably on any Linux system; on other systems, issues with codon alignment may occur.

About

This repo contains code for detecting bacterial phase variations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages