VarHunter

Bioinformatics Institute Course Project 2024

VarHunter

Maria Uzun, Nikita Zherko

Supervised by Natalia Dranenko

This tool is designed to detect bacterial phase variations.

Phase variation is the ability of pathogenic bacteria to alter their surface proteins to evade the host immune response.

This script operates in cases where switching between variants of surface proteins occurs due to inversion events affecting two proteins encoding these variants.

The pipeline consists of the following steps:

Generation of gene pairs from different genomes and performing codon alignment using PRANK
Segmentation of aligned sequences and calculating the percentage of mismatches in each segment
Detection of phase variations based on mismatches values For each row in the mismatches table (corresponding to one gene pair), the following conditions are checked:
- Presence of only one maximum value.
- Presence of a sharp jump in values in the vicinity of the maximum point.
- 70% of values below the set limit (default 20%).
- 9% of values above the set limit (default 40%).
- Absence of long end gaps.
An alignment passes the test if all the above conditions are met. If so, three more alignments for genes from the established genomes are sought, connected to this alignment. If a total of four alignments are found, two with a profile showing an extremum followed by a plateau, and two with a plateau followed by a profile showing an extremum, the names of two genomes and four genes are saved in a dataframe called pre-variations. Subsequently, the matching profiles in each pre-variation are checked, and the final dataframe with variations is outputted.
The script also generates plots based on the mismatch values.

Installation

To get the tool clone the git repository:

git clone git@github.com:uzunmasha/VarHunter.git && cd VarHunter

Create a conda/mamba environment with necessary packages and activate it:

conda env create -f environment.yml
conda activate varhunter

Usage

To run the script, call it from the directory where the tool is located:

python varhunter.py -g <genomes_csv> -l <lst_info_file> -f <fasta_file> -s <selection_condition>

Usage options:

  -h, --help            show this help message and exit

Required arguments:
  -g, --genomes [GENOMES_CSV]
                        Path to the CSV file containing genome information
  -l, --lst_info_file [LST_INFO_FILE]
                        Path to the LSTINFO file containing additional information
  -f, --fasta_file [FASTA_FILE]
                        Path to the FASTA file containing gene sequences
  -s, --selection_condition [SELECTION_CONDITION]
                        Selection condition(s) for filtering gene pairs

Optional arguments:
  -d, --division_factor [DIVISION_FACTOR]
                        Number to divide alignment length for segment calculation (default: 20)
  -p, --processes [PROCESSES]
                        Number of processes to use for parallel execution (default: 1)

Input file formats:

Input file formats can be found in the raw_data folder. Basically, they were obtained after genome annotations using PanACoTA

Genomes_csv contains names of bacterial genera, species, and strains, along with their RefSeq accessions.
Lst_info_file the sequential numbers of genomes in the format 4 letters, 4 digits, 5 digits, as well as how these sequential numbers correspond to the data in the Genomes_csv file.
Fasta_file contains target genes from analyzed genomes (e.g., histidine triad genes) from analyzed genomes.

Expected output:

If phase variations are detected, an output directory is created with the following content:

csv file with a list of genomes and their specific genes involved in phase variations
directory with alignments corresponded to gene pairs involved in phase variations
directory with plots corresponding to gene pairs mismatches involved in phase variations

Examples

To test the code functionality and output format, run it on the test data

python varhunter.py -g test_data/Genomes_names_Streptococci.csv -l test_data/Lst_info_Streptococci.lst -f test_data/Histidine_triads_Streptococci_nt.fasta -s "Streptococcus pneumoniae" -p 6

Troubleshooting

When running the script again, if folders with pairs and their alignments already exist, then these steps are skipped in order to save calculation time. If you want the code to complete all stages, these folders must be deleted manually.
At the moment, the code works reliably on any Linux system; on other systems, issues with codon alignment may occur.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
img		img
raw_data		raw_data
results		results
scripts		scripts
test_data		test_data
README.md		README.md
environment.yml		environment.yml
varhunter.py		varhunter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VarHunter

Maria Uzun, Nikita Zherko

Supervised by Natalia Dranenko

Installation

Usage

Usage options:

Input file formats:

Expected output:

Examples

Troubleshooting

About

Releases

Packages

Contributors 2

Languages

uzunmasha/VarHunter

Folders and files

Latest commit

History

Repository files navigation

VarHunter

Maria Uzun, Nikita Zherko

Supervised by Natalia Dranenko

Installation

Usage

Usage options:

Input file formats:

Expected output:

Examples

Troubleshooting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages