SMBGC Annotation using Neural Networks Trained on Interpro Signatures
Tool for identifying biosynthetic gene clusters (BGCs) in genomic & metagenomic data
- Linux OS/Unix-like
- InterProScan > 5.52-86.0:
- Non Linux OS can't run InterProScan. InterProScan output must be provided in TSV or GFF3 format sing "--ip-file" and a GBK as SEQUENCE
- Bioconda:
conda create -n sanntis sanntis
conda activate sanntis
Tests require pytest-workflow.
pip install pytest-workflow
Verify installation using preprocessed InterProScan outputs (supported on macOS and Linux).
pytest --tag sanntis_with_preprocessed_files
Verify installation and ensure InterProScan is set up correctly (supported on Linux).
pytest --tag sanntis_full_dependencies
SanntiS can be executed using preprocessed InterProScan outputs along with a GenBank (GBK) file specifying the coding sequences (CDSs). This integration increases user flexibility.
conda activate sanntis
sanntis --ip-file test/files/BGC0001472.fna.prodigal.faa.gff3 test/files/
conda deactivate sanntis
Additionally, the --ip-file option can be provided with a protein FASTA file containing headers formatted according to Prodigal's convention. In this case, the --is_protein flag must be included to indicate that the sequence file is a proteins FASTA.
conda activate sanntis
sanntis --is_protein --ip-file test/files/BGC0001472.fna.prodigal.faa.gff3 test/files/BGC0001472.fna.prodigal.faa
conda deactivate sanntis
bash ./ --help [OPTIONS] ARGUMENTS
docker -it --entrypoint bash -v <path to SanntiS/docker>/data/:/opt/interproscan
sanntis --help
GFF3 format file
The fields in this header are as follows:
seqname: SeqID of contig, as in prodigal output.
source: sanntis version.
feature: Feature type name, i.e. CLUSTER, CLUSTER_border, CDS.
start: Start position of feature
end: End position of feature
score: empty
strand: empty
frame: empty
ID: ordinal ID for the cluster, beginning with 1.
nearest_MiBIG: MiBIG accession of the nearest BGC to the cluster in the MIBIG space, measured in Dice dissimilarity coefficient.
nearest_MiBIG_class: BGC class of nearest_MiBIG.
nearest_MiBIG_diceDistance: Dice dissimilarity coefficient between ID and nearest_MiBIG.
score: Post-processing probability output.
partial: Indicates if a CLUSTER is at the edge of the contig. First and second digits represent 5' and 3' end, respectively. Same as in prodigal's `partial`. "0" shows the cluster is not at the edge, whereas a "1" indicates is at that edge, (i.e. a partial cluster).
##gff-version 3
DS999642 SanntiSv0.9.0 CLUSTER 1 136970 . . . ID=DS999642_sanntis_1;nearest_MiBIG=BGC0001397;nearest_MiBIG_class=NRP Polyketide;nearest_MiBIG_diceDistance=0.561;partial=10
SanntiS prioritises seamless integration with various downstream analysis tools, leveraging a GFF3 file output for broad compatibility. In addition, one of the key features in this regard is the ability to generate an output compatible with antiSMASH, a widely used tool in the BGC analysis ecosystem.
SanntiS has an --antismash_output
option. This option allows you to create a JSON file formatted according to the specifications of antiSMASH.
sanntis --antismash_output True test/files/BGC0001472.fna
Executing the command above produces a file named with the suffix antismash.json
facilitating its use in antiSMASH for enriched analysis. Specifically, this file can be uploaded to the antiSMASH web server under 'Data input' > 'Upload extra annotations', allowing for an integrated analytical approach that leverages external annotation data.
If you use SanntiS make sure to cite the publication Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS
Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS
Santiago Sanchez, Joel D. Rogers, Alexander B. Rogers, Maaly Nassar, Johanna McEntyre, Martin Welch, Florian Hollfelder, Robert D. Finn
bioRxiv 2023.05.23.540769; doi: