A Tool for Calculating Bacterial Gene Co-occurrence Without Phylogenetic Inference
Description
Installation
Basic Usage
Workflow
Tutorial With Sample Data
Jan 25th 2024: A link to bioRxiv software announcement paper will be provided soon.
An example of GeneCoOccurrence usage is found in this Nature Microbiology study of unique genomic islands in the current pandemic strain of Vibrio cholerae.
GeneCoOccurrence enables investigation of the possible functional pairing of co-occurring bacterial genes on genetic elements that lack informative phylogeny. This uniquely allows investigation of horizontally acquired regions of DNA in genomic islands, mobile genetic elements, etc. The software works by calculating the frequency at which a pair of genes are jointly present within individual genomes across a given set of sequenced bacteria. The output includes co-occurrence scores, heatmaps, and graphical networks to provide context to the co-occurrence of gene pairs.
See Tutorial With Sample Data below to see sample input and step-by-step examples.
GeneCoOccurrence is written in Python3 and compatible with Linux and Mac OS.
We recommend creating an isolated environment using Conda, followed by installation using the Python package manager pip.
- Create Conda enviornment named 'gco' (i.e. GeneCoOccurrence)
conda create --name gco python=3.11
- Enter the 'gco' isolated enviornment
conda activate gco
- Install GeneCoOccurrence using pip with our package at testpypi
python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ genecooccurrence
gco -i ./VSPI_genes_BLAST_results.xml -o ./VSPI -c ./prot_ID_to_common_name.csv
-
-i <Input File>
requiredBLAST results of your genes of interest (saved as a single .xml file, which is an available output format for BLAST on NCBI website and stand-alone command line tool).
OR
Any binary presence/absence matrix in comma-separated values (.csv) file. It should be utf-8 formatted, comma separated, and have a '.csv' suffix.
-
-o <Output Directory>
optionalOptional: If not provided defaults to current working directory
-
-c <Common Name>
optionalOptional: A .csv file that allows conversion of BLAST query protein ID's to common gene names of your choosing.
Workflow of GeneCoOccurrence. Output folders in grey, input file in blue, intermediate output files in red, final output files in green. A. User input is either a presence/absence matrix OR BLAST results. B. A presence/absence matrix is generated if BLAST results were chosen as input. C. The co-occurrence for all GOIs i to j is summed and fed into a Pearson Correlation followed by a partial correlation correction which results in co-occurrence score. D. Output includes a co-occurrence heatmap of all genes i to j, maximum related subnetwork visuals, and a co-occurrence table.
1. Using BLAST Results You may wish to use BLAST results of genes of interest to calculate gene pair co-occurrence, most likely to infer functional relatedness.
2. Using A Custom Presence/Absence Matrix You may wish to use a custom presence/absence matrix, for example to calculate the co-occurrence of antibiotic resistance genes within metagenomic data, etc.
- First install GeneCoOccurrence as described above. Ensure the gco environment is activated as in step 2 of the installation.
- Right-click this link and choose 'Save As' to save a zipped directory containing the example data and output.
- Navigate to and unzip the previous directory, which should have this structure:
└── tutorial
├── manure_metagenomic_study
│ ├── unenriched_gene
│ └── unenriched_gene_matrix.csv
├── README.txt
└── vibrio_VSPI_study
├── prot_ID_to_common_name.csv
├── VSPI
└── VSPI_genes_BLAST_results.xml
- Enter the vibrio_VSPI_study directory, which contains the two input files: VSPI_genes_BLAST_results.xml and prot_ID_to_common_name.csv
- Run the command:
gco -i ./VSPI_genes_BLAST_results.xml -o ./VSPI -c ./protein_id_to_common_names.csv
- Note this command will overwrite the existing output directory VSPI. See workflow above to see the output files available in directory VSPI.
Helpful Information for Generating Your Own Compatible BLAST Data
It is important that BLAST matches have enough sequence similarity to the GOI to reasonably infer homology. For this reason, the BLAST program (e.g., BLASTp or BLASTn) and E-value cutoff value should be carefully chosen given the context of your organism(s) and possible MGEs of interest. Our study into functionally related gene pairs in Vibrio cholerae used BLASTp with an E-value cutoff of 10^-4 in the NCBI protein non-redundant (nr) database limited to taxid:2 (bacteria). Additionally, we note that the online BLAST tool provided by NCBI limits the total number of total hits returned. This means that hits meeting the search criteria could be arbitrarily dropped, reducing the input available to GeneCoOccurrence to make predictions. In these cases, users should consider using localized BLAST searches. The BLAST search should have an output format of ‘5’ (single-file .xml).
Helpful Information for Creating a Common Name .csv
We offer the ability to convert less helpful BLAST query gene IDs (e.g. BAF33440) to common gene names (e.g. KfrC). These common gene names are subsequently used in intermediate and final output files (including visualizations). To enable this feature, use the ‘-c’ flag (e.g. -c gene_name_conversions.csv) pointing to a comma-separated .csv file. This file should be formatted with query gene IDs in the first column and corresponding common gene names on an equivalent row in the second column.
- First install GeneCoOccurrence as described above. Ensure the gco environment is activated as in step 2 of the installation.
- Right-click this link and choose 'Save As' to save a zipped directory containing the example data and output.
- Navigate to and unzip the previous directory, which should have this structure:
└── tutorial
├── manure_metagenomic_study
│ ├── unenriched_gene
│ └── unenriched_gene_matrix.csv
├── README.txt
└── vibrio_VSPI_study
├── prot_ID_to_common_name.csv
├── VSPI
└── VSPI_genes_BLAST_results.xml
- Enter the manure_metagenomic_study directory, which contains the single input file: unenriched_gene_matrix.csv
- Run the command:
gco -i ./unenriched_gene_matrix.csv -o ./unenriched_gene
- Note this command will overwrite the existing output directory unenriched_gene. See workflow above to see the output files available in directory unenriched_gene.
Copyright (C) 2018-2024 Clinton A. Elg Please contact via Github
This project is licensed under the GNU General Public License, Version 2. Please see the LICENSE.md link for details.