Support user-provided feature regions (#259)

* introduce user-provided regions * implement creation & overlap filters * amend INSDC/GFF3 inferences * fix source value error in GFF3 output * actually add user-provided cdss in main * fix CDS start & strand * add tests * implement GFF3 region support * add GFF3 tests * refactor code in cds module * bump version to v1.9.0-beta * update readme * polish test cases * add overlapping CDS test case * distinguish de novo from user-provided ORF aa seqs * refactor tests for better readbility * amend readme [skip ci]
oschwengers · Nov 17, 2023 · 317107c · 317107c
1 parent a99ae5b
commit 317107c
Show file tree

Hide file tree

Showing 15 changed files with 446 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -226,9 +226,13 @@ NODE_3 | p2 | `p`  |  `c` | `pXYZ2`
 NODE_4 | special-contig-name-xyz |  `-` | -
 NODE_5 | `` |  `-` | -
 
-#### User provided protein sequences
+#### User-provided regions
 
-Bakta accepts user provided trusted protein sequences via `--proteins` in either GenBank (CDS features) or Fasta format. Using the Fasta format, each reference sequence can be provided in a short or long format:
+Bakta accepts pre-annotated, user-provided feature regions via `--regions` in either GFF3 or GenBank format. These regions supersede all de novo-predicted regions, but are equally subject to the internal functional annotation process. Currently, only `CDS` are supported. A maximum overlap with de novo-predicted CDS of 30 bp is allowed. If you would like to provide custom functional annotations, you can provide these via `--proteins` which is described in the following section.
+
+#### User-provided protein sequences
+
+Bakta accepts user-provided trusted protein sequences via `--proteins` in either GenBank (CDS features) or Fasta format which are used in the functional annotation process. Using the Fasta format, each reference sequence can be provided in a short or long format:
 
 ```bash
 # short:
@@ -324,7 +328,7 @@ Exemplary annotation result files for several genomes (mostly ESKAPE species) ar
 usage: bakta [--db DB] [--min-contig-length MIN_CONTIG_LENGTH] [--prefix PREFIX] [--output OUTPUT]
              [--genus GENUS] [--species SPECIES] [--strain STRAIN] [--plasmid PLASMID]
              [--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4}] [--gram {+,-,?}] [--locus LOCUS]
-             [--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--compliant] [--proteins PROTEINS] [--meta]
+             [--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--compliant] [--proteins PROTEINS] [--meta] [--regions REGIONS]
              [--skip-trna] [--skip-tmrna] [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region]
              [--skip-crispr] [--skip-cds] [--skip-pseudo] [--skip-sorf] [--skip-gap] [--skip-ori] [--skip-plot]
              [--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]
@@ -368,6 +372,7 @@ Annotation:
   --compliant           Force Genbank/ENA/DDJB compliance
   --proteins PROTEINS   Fasta file of trusted protein sequences for CDS annotation
   --meta                Run in metagenome mode. This only affects CDS prediction.
+  --regions REGIONS     Path to pre-annotated regions in GFF3 or Genbank format (regions only, no functional annotations).
 
 Workflow:
   --skip-trna           Skip tRNA detection & annotation
@@ -422,7 +427,7 @@ ncRNA (cis-regulatory) region types:
 
 ### Coding sequences
 
-The structural prediction is conducted via Pyrodigal and complemented by a custom detection of sORF < 30 aa.
+The structural prediction is conducted via Pyrodigal and complemented by a custom detection of sORF < 30 aa. In addition, superseding regions of pre-predicted CDS can be provided via `--regions`.
 
 To rapidly identify known protein sequences with exact sequence matches and to conduct a comprehensive annotations, Bakta utilizes a compact read-only SQLite database comprising protein sequence digests and pre-assigned annotations for millions of known protein sequences and clusters.
 
@@ -435,22 +440,23 @@ Conceptual terms:
 
 **CDS**:
 
-1. Prediction via Pyrodigal respecting sequences' completeness (distinct prediction for complete replicons and uncompleted contigs)
+1. De novo-prediction via Pyrodigal respecting sequences' completeness (distinct prediction for complete replicons and uncompleted contigs)
 2. Discard spurious CDS via AntiFam
 3. Detect translational exceptions (selenocysteines)
-4. Detection of UPSs via MD5 digests and lookup of related IPS and PCS
-5. Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)
-6. Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively
-7. Execution of expert systems:
+4. Import of superseding user-provided CDS regions (optional)
+5. Detection of UPSs via MD5 digests and lookup of related IPS and PCS
+6. Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)
+7. Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively
+8. Execution of expert systems:
    - AMR: AMRFinderPlus
    - Expert proteins: NCBI BlastRules, VFDB
    - User proteins (optionally via `--proteins <Fasta/GenBank>`)
-8. Prediction of signal peptides (optionally via `--gram <+/->`)
-9. Detection of pseudogenes:
+9. Prediction of signal peptides (optionally via `--gram <+/->`)
+10. Detection of pseudogenes:
    1. Search for reference PCSs using `hypothetical` CDS as seed sequences
    2. Translated alignment (blastx) of reference PCSs against up-/downstream-elongated CDS regions
    3. Analysis of translated alignments and detection of pseudogenization causes & effects
-10. Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy
+11. Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy
 
 CDS without IPS or PSC hits as well as those without gene symbols or product descriptions different from `hypothetical` will be marked as `hypothetical`.
 

diff --git a/bakta/__init__.py b/bakta/__init__.py
@@ -1,2 +1,2 @@
-__version__ = '1.8.2'
+__version__ = '1.9.0-beta'
 __db_schema_version__ = 5
diff --git a/bakta/config.py b/bakta/config.py
@@ -55,6 +55,7 @@
 compliant = None
 user_proteins = None
 meta = None
+regions = None
 
 # workflow configuration
 skip_trna = None
@@ -160,7 +161,7 @@ def setup(args):
         taxon = None
 
     # annotation configurations
-    global complete, prodigal_tf, translation_table, keep_contig_headers, locus, locus_tag, gram, replicons, compliant, user_proteins, meta
+    global complete, prodigal_tf, translation_table, keep_contig_headers, locus, locus_tag, gram, replicons, compliant, user_proteins, meta, regions
     complete = args.complete
     log.info('complete=%s', complete)
     prodigal_tf = args.prodigal_tf
@@ -232,6 +233,19 @@ def setup(args):
             sys.exit(f'ERROR: replicon table file ({replicons}) not valid!')
     log.info('replicon-table=%s', replicons)
     user_proteins = check_user_proteins(args)
+    regions = args.regions
+    if(regions is not None):
+        try:
+            if(regions == ''):
+                raise ValueError('File path argument must be non-empty')
+            regions_path = Path(args.regions).resolve()
+            check_readability('regions', regions_path)
+            check_content_size('regions', regions_path)
+            regions = regions_path
+        except:
+            log.error('provided regions file not valid! path=%s', regions)
+            sys.exit(f'ERROR: regions file ({regions}) not valid!')
+    log.info('regions=%s', regions)
 
 
     # workflow configurations

diff --git a/bakta/constants.py b/bakta/constants.py
@@ -19,14 +19,16 @@
 
 
 ############################################################################
-# Protein identification settings
+# CDS settings
 ############################################################################
 MIN_PSCC_IDENTITY = 0.5  # min protein identity for PSC detection
 MIN_PSC_COVERAGE = 0.8  # min protein coverage for PSC detection
 MIN_PSC_IDENTITY = 0.9  # min protein identity for PSC detection
 MIN_SORF_COVERAGE = 0.9  # min sORF coverage for PSC detection
 MIN_SORF_IDENTITY = 0.9  # min sORF identity for PSC detection
 HYPOTHETICAL_PROTEIN = 'hypothetical protein'  # hypothetical protein product description
+CDS_MAX_OVERLAPS = 30  # max overlap [bp] allowed for user-provided/de novo-predicted CDS overlaps
+CDS_SOURCE_USER = 'user-provided'
 
 
 ############################################################################

diff --git a/bakta/features/annotation.py b/bakta/features/annotation.py
@@ -169,8 +169,12 @@ def detect_feature_overlaps(genome: dict):
         crispr_arrays = contig_crispr_arrays[crispr_array['contig']]
         crispr_arrays.append(crispr_array)
     contig_cdss = {k['id']: [] for k in genome['contigs']}
+    contig_cdss_user_provided = {k['id']: [] for k in genome['contigs']}
     for cds in genome['features'].get(bc.FEATURE_CDS, []):
-        cdss = contig_cdss[cds['contig']]
+        if(cds.get('source', None) == bc.CDS_SOURCE_USER):
+            cdss = contig_cdss_user_provided[cds['contig']]
+        else:
+            cdss = contig_cdss[cds['contig']]
         cdss.append(cds)
     contig_sorfs = {k['id']: [] for k in genome['contigs']}
     for sorf in genome['features'].get(bc.FEATURE_SORF, []):
@@ -217,7 +221,7 @@ def detect_feature_overlaps(genome: dict):
                             ncRNA_region['product'], ncRNA_region['start'], ncRNA_region['stop'], ncRNA_region_overlap['product'], ncRNA_region_overlap['start'], ncRNA_region_overlap['stop'], overlap, ncRNA_region['contig'], ncRNA_region['score'], ncRNA_region_overlap['score']
                         )
 
-        # mark CDS overlapping with tRNAs, tmRNAs, rRNAs, CRISPRs
+        # mark de novo-predicted CDS overlapping with tRNAs, tmRNAs, rRNAs, CRISPRs and user-provided CDS
         for cds in contig_cdss[contig['id']]:
             # tmRNA overlaps
             for tmRNA in contig_tm_rnas[contig['id']]:
@@ -279,6 +283,23 @@ def detect_feature_overlaps(genome: dict):
                         "overlap: CDS (%s/%s) [%i, %i] overlapping CRISPR [%i, %i], %s, contig=%s",
                         cds.get('gene', '-'), cds.get('product', '-'), cds['start'], cds['stop'], crispr['start'], crispr['stop'], overlap, cds['contig']
                     )
+            # user-provided CDS overlaps
+            for cds_user_provided in contig_cdss_user_provided[contig['id']]:
+                if(cds['stop'] < cds_user_provided['start'] or cds['start'] > cds_user_provided['stop']):
+                    continue
+                else:  # overlap -> remove cds
+                    overlap = min(cds['stop'], cds_user_provided['stop']) - max(cds['start'], cds_user_provided['start']) + 1
+                    if(overlap > bc.CDS_MAX_OVERLAPS):
+                        overlap = f"[{max(cds['start'], cds_user_provided['start'])},{min(cds['stop'], cds_user_provided['stop'])}]"
+                        cds['discarded'] = {
+                            'type': bc.DISCARD_TYPE_OVERLAP,
+                            'feature_type': bc.FEATURE_CDS,
+                            'description': f'overlaps {bc.FEATURE_CDS} at {overlap}'
+                        }
+                        log.info(
+                            "overlap: de novo CDS (%s/%s) [%i, %i] overlapping user-provided CDS [%i, %i], %s, contig=%s",
+                            cds.get('gene', '-'), cds.get('product', '-'), cds['start'], cds['stop'], cds_user_provided['start'], cds_user_provided['stop'], overlap, cds['contig']
+                        )
 
         # remove sORF overlapping with tRNAs, tmRNAs, rRNAs, CRISPRs, inframe CDSs, shorter inframe sORFs
         for sorf in contig_sorfs[contig['id']]:
@@ -342,7 +363,21 @@ def detect_feature_overlaps(genome: dict):
                         "overlap: sORF (%s/%s) [%i, %i] overlapping CRISPR [%i, %i], %s, contig=%s",
                         sorf.get('gene', '-'), sorf.get('product', '-'), sorf['start'], sorf['stop'], crispr['start'], crispr['stop'], overlap, sorf['contig']
                     )
-            # CDS overlaps (skipped as most overlaps are filtered in sORF detection)
+            # user-provided CDS overlaps
+            for cds_user_provided in contig_cdss_user_provided[contig['id']]:
+                if(sorf['stop'] < cds_user_provided['start'] or sorf['start'] > cds_user_provided['stop']):
+                    continue
+                else:  # overlap -> remove sorf
+                    overlap = f"[{max(sorf['start'], cds_user_provided['start'])},{min(sorf['stop'], cds_user_provided['stop'])}]"
+                    sorf['discarded'] = {
+                        'type': bc.DISCARD_TYPE_OVERLAP,
+                        'feature_type': bc.FEATURE_CDS,
+                        'description': f'overlaps {bc.FEATURE_CDS} at {overlap}'
+                    }
+                    log.info(
+                        "overlap: sORF (%s/%s) [%i, %i] overlapping user-provided CDS [%i, %i], %s, contig=%s",
+                        sorf.get('gene', '-'), sorf.get('product', '-'), sorf['start'], sorf['stop'], cds_user_provided['start'], cds_user_provided['stop'], overlap, sorf['contig']
+                    )
 
             # sORF overlaps
             for overlap_sorf in contig_sorfs[contig['id']]: