Bioinformatics Institute Course Project 2024
This tool is designed to detect bacterial phase variations.
Phase variation is the ability of pathogenic bacteria to alter their surface proteins to evade the host immune response.
This script operates in cases where switching between variants of surface proteins occurs due to inversion events affecting two proteins encoding these variants.
The pipeline consists of the following steps:
-
Generation of gene pairs from different genomes and performing codon alignment using PRANK
-
Segmentation of aligned sequences and calculating the percentage of mismatches in each segment
-
Detection of phase variations based on mismatches values For each row in the mismatches table (corresponding to one gene pair), the following conditions are checked:
- Presence of only one maximum value.
- Presence of a sharp jump in values in the vicinity of the maximum point.
- 70% of values below the set limit (default 20%).
- 9% of values above the set limit (default 40%).
- Absence of long end gaps.
An alignment passes the test if all the above conditions are met. If so, three more alignments for genes from the established genomes are sought, connected to this alignment. If a total of four alignments are found, two with a profile showing an extremum followed by a plateau, and two with a plateau followed by a profile showing an extremum, the names of two genomes and four genes are saved in a dataframe called pre-variations. Subsequently, the matching profiles in each pre-variation are checked, and the final dataframe with variations is outputted.
-
The script also generates plots based on the mismatch values.
To get the tool clone the git repository:
git clone git@github.com:uzunmasha/VarHunter.git && cd VarHunter
Create a conda/mamba environment with necessary packages and activate it:
conda env create -f environment.yml
conda activate varhunter
To run the script, call it from the directory where the tool is located:
python varhunter.py -g <genomes_csv> -l <lst_info_file> -f <fasta_file> -s <selection_condition>
-h, --help show this help message and exit
Required arguments:
-g, --genomes [GENOMES_CSV]
Path to the CSV file containing genome information
-l, --lst_info_file [LST_INFO_FILE]
Path to the LSTINFO file containing additional information
-f, --fasta_file [FASTA_FILE]
Path to the FASTA file containing gene sequences
-s, --selection_condition [SELECTION_CONDITION]
Selection condition(s) for filtering gene pairs
Optional arguments:
-d, --division_factor [DIVISION_FACTOR]
Number to divide alignment length for segment calculation (default: 20)
-p, --processes [PROCESSES]
Number of processes to use for parallel execution (default: 1)
Input file formats can be found in the raw_data folder. Basically, they were obtained after genome annotations using PanACoTA
Genomes_csv
contains names of bacterial genera, species, and strains, along with their RefSeq accessions.Lst_info_file
the sequential numbers of genomes in the format 4 letters, 4 digits, 5 digits, as well as how these sequential numbers correspond to the data in the Genomes_csv file.Fasta_file
contains target genes from analyzed genomes (e.g., histidine triad genes) from analyzed genomes.
If phase variations are detected, an output directory is created with the following content:
- csv file with a list of genomes and their specific genes involved in phase variations
- directory with alignments corresponded to gene pairs involved in phase variations
- directory with plots corresponding to gene pairs mismatches involved in phase variations
To test the code functionality and output format, run it on the test data
python varhunter.py -g test_data/Genomes_names_Streptococci.csv -l test_data/Lst_info_Streptococci.lst -f test_data/Histidine_triads_Streptococci_nt.fasta -s "Streptococcus pneumoniae" -p 6
- When running the script again, if folders with pairs and their alignments already exist, then these steps are skipped in order to save calculation time. If you want the code to complete all stages, these folders must be deleted manually.
- At the moment, the code works reliably on any Linux system; on other systems, issues with codon alignment may occur.