Skip to content

Commit

Permalink
Support user-provided feature regions (#259)
Browse files Browse the repository at this point in the history
* introduce user-provided regions
* implement creation & overlap filters
* amend INSDC/GFF3 inferences
* fix source value error in GFF3 output
* actually add user-provided cdss in main
* fix CDS start & strand
* add tests
* implement GFF3 region support
* add GFF3 tests
* refactor code in cds module
* bump version to v1.9.0-beta
* update readme
* polish test cases
* add overlapping CDS test case
* distinguish de novo from user-provided ORF aa seqs
* refactor tests for better readbility
* amend readme [skip ci]
  • Loading branch information
oschwengers authored Nov 17, 2023
1 parent a99ae5b commit 317107c
Show file tree
Hide file tree
Showing 15 changed files with 446 additions and 37 deletions.
30 changes: 18 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,9 +226,13 @@ NODE_3 | p2 | `p` | `c` | `pXYZ2`
NODE_4 | special-contig-name-xyz | `-` | -
NODE_5 | `` | `-` | -

#### User provided protein sequences
#### User-provided regions

Bakta accepts user provided trusted protein sequences via `--proteins` in either GenBank (CDS features) or Fasta format. Using the Fasta format, each reference sequence can be provided in a short or long format:
Bakta accepts pre-annotated, user-provided feature regions via `--regions` in either GFF3 or GenBank format. These regions supersede all de novo-predicted regions, but are equally subject to the internal functional annotation process. Currently, only `CDS` are supported. A maximum overlap with de novo-predicted CDS of 30 bp is allowed. If you would like to provide custom functional annotations, you can provide these via `--proteins` which is described in the following section.

#### User-provided protein sequences

Bakta accepts user-provided trusted protein sequences via `--proteins` in either GenBank (CDS features) or Fasta format which are used in the functional annotation process. Using the Fasta format, each reference sequence can be provided in a short or long format:

```bash
# short:
Expand Down Expand Up @@ -324,7 +328,7 @@ Exemplary annotation result files for several genomes (mostly ESKAPE species) ar
usage: bakta [--db DB] [--min-contig-length MIN_CONTIG_LENGTH] [--prefix PREFIX] [--output OUTPUT]
[--genus GENUS] [--species SPECIES] [--strain STRAIN] [--plasmid PLASMID]
[--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4}] [--gram {+,-,?}] [--locus LOCUS]
[--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--compliant] [--proteins PROTEINS] [--meta]
[--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--compliant] [--proteins PROTEINS] [--meta] [--regions REGIONS]
[--skip-trna] [--skip-tmrna] [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region]
[--skip-crispr] [--skip-cds] [--skip-pseudo] [--skip-sorf] [--skip-gap] [--skip-ori] [--skip-plot]
[--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]
Expand Down Expand Up @@ -368,6 +372,7 @@ Annotation:
--compliant Force Genbank/ENA/DDJB compliance
--proteins PROTEINS Fasta file of trusted protein sequences for CDS annotation
--meta Run in metagenome mode. This only affects CDS prediction.
--regions REGIONS Path to pre-annotated regions in GFF3 or Genbank format (regions only, no functional annotations).

Workflow:
--skip-trna Skip tRNA detection & annotation
Expand Down Expand Up @@ -422,7 +427,7 @@ ncRNA (cis-regulatory) region types:
### Coding sequences
The structural prediction is conducted via Pyrodigal and complemented by a custom detection of sORF < 30 aa.
The structural prediction is conducted via Pyrodigal and complemented by a custom detection of sORF < 30 aa. In addition, superseding regions of pre-predicted CDS can be provided via `--regions`.
To rapidly identify known protein sequences with exact sequence matches and to conduct a comprehensive annotations, Bakta utilizes a compact read-only SQLite database comprising protein sequence digests and pre-assigned annotations for millions of known protein sequences and clusters.
Expand All @@ -435,22 +440,23 @@ Conceptual terms:

**CDS**:

1. Prediction via Pyrodigal respecting sequences' completeness (distinct prediction for complete replicons and uncompleted contigs)
1. De novo-prediction via Pyrodigal respecting sequences' completeness (distinct prediction for complete replicons and uncompleted contigs)
2. Discard spurious CDS via AntiFam
3. Detect translational exceptions (selenocysteines)
4. Detection of UPSs via MD5 digests and lookup of related IPS and PCS
5. Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)
6. Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively
7. Execution of expert systems:
4. Import of superseding user-provided CDS regions (optional)
5. Detection of UPSs via MD5 digests and lookup of related IPS and PCS
6. Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)
7. Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively
8. Execution of expert systems:
- AMR: AMRFinderPlus
- Expert proteins: NCBI BlastRules, VFDB
- User proteins (optionally via `--proteins <Fasta/GenBank>`)
8. Prediction of signal peptides (optionally via `--gram <+/->`)
9. Detection of pseudogenes:
9. Prediction of signal peptides (optionally via `--gram <+/->`)
10. Detection of pseudogenes:
1. Search for reference PCSs using `hypothetical` CDS as seed sequences
2. Translated alignment (blastx) of reference PCSs against up-/downstream-elongated CDS regions
3. Analysis of translated alignments and detection of pseudogenization causes & effects
10. Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy
11. Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy
CDS without IPS or PSC hits as well as those without gene symbols or product descriptions different from `hypothetical` will be marked as `hypothetical`.
Expand Down
2 changes: 1 addition & 1 deletion bakta/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__version__ = '1.8.2'
__version__ = '1.9.0-beta'
__db_schema_version__ = 5
16 changes: 15 additions & 1 deletion bakta/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
compliant = None
user_proteins = None
meta = None
regions = None

# workflow configuration
skip_trna = None
Expand Down Expand Up @@ -160,7 +161,7 @@ def setup(args):
taxon = None

# annotation configurations
global complete, prodigal_tf, translation_table, keep_contig_headers, locus, locus_tag, gram, replicons, compliant, user_proteins, meta
global complete, prodigal_tf, translation_table, keep_contig_headers, locus, locus_tag, gram, replicons, compliant, user_proteins, meta, regions
complete = args.complete
log.info('complete=%s', complete)
prodigal_tf = args.prodigal_tf
Expand Down Expand Up @@ -232,6 +233,19 @@ def setup(args):
sys.exit(f'ERROR: replicon table file ({replicons}) not valid!')
log.info('replicon-table=%s', replicons)
user_proteins = check_user_proteins(args)
regions = args.regions
if(regions is not None):
try:
if(regions == ''):
raise ValueError('File path argument must be non-empty')
regions_path = Path(args.regions).resolve()
check_readability('regions', regions_path)
check_content_size('regions', regions_path)
regions = regions_path
except:
log.error('provided regions file not valid! path=%s', regions)
sys.exit(f'ERROR: regions file ({regions}) not valid!')
log.info('regions=%s', regions)


# workflow configurations
Expand Down
4 changes: 3 additions & 1 deletion bakta/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,16 @@


############################################################################
# Protein identification settings
# CDS settings
############################################################################
MIN_PSCC_IDENTITY = 0.5 # min protein identity for PSC detection
MIN_PSC_COVERAGE = 0.8 # min protein coverage for PSC detection
MIN_PSC_IDENTITY = 0.9 # min protein identity for PSC detection
MIN_SORF_COVERAGE = 0.9 # min sORF coverage for PSC detection
MIN_SORF_IDENTITY = 0.9 # min sORF identity for PSC detection
HYPOTHETICAL_PROTEIN = 'hypothetical protein' # hypothetical protein product description
CDS_MAX_OVERLAPS = 30 # max overlap [bp] allowed for user-provided/de novo-predicted CDS overlaps
CDS_SOURCE_USER = 'user-provided'


############################################################################
Expand Down
41 changes: 38 additions & 3 deletions bakta/features/annotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,8 +169,12 @@ def detect_feature_overlaps(genome: dict):
crispr_arrays = contig_crispr_arrays[crispr_array['contig']]
crispr_arrays.append(crispr_array)
contig_cdss = {k['id']: [] for k in genome['contigs']}
contig_cdss_user_provided = {k['id']: [] for k in genome['contigs']}
for cds in genome['features'].get(bc.FEATURE_CDS, []):
cdss = contig_cdss[cds['contig']]
if(cds.get('source', None) == bc.CDS_SOURCE_USER):
cdss = contig_cdss_user_provided[cds['contig']]
else:
cdss = contig_cdss[cds['contig']]
cdss.append(cds)
contig_sorfs = {k['id']: [] for k in genome['contigs']}
for sorf in genome['features'].get(bc.FEATURE_SORF, []):
Expand Down Expand Up @@ -217,7 +221,7 @@ def detect_feature_overlaps(genome: dict):
ncRNA_region['product'], ncRNA_region['start'], ncRNA_region['stop'], ncRNA_region_overlap['product'], ncRNA_region_overlap['start'], ncRNA_region_overlap['stop'], overlap, ncRNA_region['contig'], ncRNA_region['score'], ncRNA_region_overlap['score']
)

# mark CDS overlapping with tRNAs, tmRNAs, rRNAs, CRISPRs
# mark de novo-predicted CDS overlapping with tRNAs, tmRNAs, rRNAs, CRISPRs and user-provided CDS
for cds in contig_cdss[contig['id']]:
# tmRNA overlaps
for tmRNA in contig_tm_rnas[contig['id']]:
Expand Down Expand Up @@ -279,6 +283,23 @@ def detect_feature_overlaps(genome: dict):
"overlap: CDS (%s/%s) [%i, %i] overlapping CRISPR [%i, %i], %s, contig=%s",
cds.get('gene', '-'), cds.get('product', '-'), cds['start'], cds['stop'], crispr['start'], crispr['stop'], overlap, cds['contig']
)
# user-provided CDS overlaps
for cds_user_provided in contig_cdss_user_provided[contig['id']]:
if(cds['stop'] < cds_user_provided['start'] or cds['start'] > cds_user_provided['stop']):
continue
else: # overlap -> remove cds
overlap = min(cds['stop'], cds_user_provided['stop']) - max(cds['start'], cds_user_provided['start']) + 1
if(overlap > bc.CDS_MAX_OVERLAPS):
overlap = f"[{max(cds['start'], cds_user_provided['start'])},{min(cds['stop'], cds_user_provided['stop'])}]"
cds['discarded'] = {
'type': bc.DISCARD_TYPE_OVERLAP,
'feature_type': bc.FEATURE_CDS,
'description': f'overlaps {bc.FEATURE_CDS} at {overlap}'
}
log.info(
"overlap: de novo CDS (%s/%s) [%i, %i] overlapping user-provided CDS [%i, %i], %s, contig=%s",
cds.get('gene', '-'), cds.get('product', '-'), cds['start'], cds['stop'], cds_user_provided['start'], cds_user_provided['stop'], overlap, cds['contig']
)

# remove sORF overlapping with tRNAs, tmRNAs, rRNAs, CRISPRs, inframe CDSs, shorter inframe sORFs
for sorf in contig_sorfs[contig['id']]:
Expand Down Expand Up @@ -342,7 +363,21 @@ def detect_feature_overlaps(genome: dict):
"overlap: sORF (%s/%s) [%i, %i] overlapping CRISPR [%i, %i], %s, contig=%s",
sorf.get('gene', '-'), sorf.get('product', '-'), sorf['start'], sorf['stop'], crispr['start'], crispr['stop'], overlap, sorf['contig']
)
# CDS overlaps (skipped as most overlaps are filtered in sORF detection)
# user-provided CDS overlaps
for cds_user_provided in contig_cdss_user_provided[contig['id']]:
if(sorf['stop'] < cds_user_provided['start'] or sorf['start'] > cds_user_provided['stop']):
continue
else: # overlap -> remove sorf
overlap = f"[{max(sorf['start'], cds_user_provided['start'])},{min(sorf['stop'], cds_user_provided['stop'])}]"
sorf['discarded'] = {
'type': bc.DISCARD_TYPE_OVERLAP,
'feature_type': bc.FEATURE_CDS,
'description': f'overlaps {bc.FEATURE_CDS} at {overlap}'
}
log.info(
"overlap: sORF (%s/%s) [%i, %i] overlapping user-provided CDS [%i, %i], %s, contig=%s",
sorf.get('gene', '-'), sorf.get('product', '-'), sorf['start'], sorf['stop'], cds_user_provided['start'], cds_user_provided['stop'], overlap, sorf['contig']
)

# sORF overlaps
for overlap_sorf in contig_sorfs[contig['id']]:
Expand Down
Loading

0 comments on commit 317107c

Please sign in to comment.