generated from linkml/linkml-template
-
Notifications
You must be signed in to change notification settings - Fork 8
/
annotation.yaml
220 lines (200 loc) · 7.83 KB
/
annotation.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
id: https://w3id.org/nmdc/annotation
name: NMDC-Annotation
title: Annotation Module for NMDC Schema
description: >-
This module in the schema is for representing annotations including functional annotations of proteins and other gene products,
as well as controlled terms for describing things like metabolites
license: https://creativecommons.org/publicdomain/zero/1.0/
see_also:
- https://github.com/microbiomedata/nmdc-metadata/issues/176
notes:
- Removed slot_uri of rdf:type from type slot. Is OntologyClass an appropriate range?
imports:
- core
prefixes:
COG: "https://bioregistry.io/cog:"
EC: "https://bioregistry.io/eccode:"
GO: http://purl.obolibrary.org/obo/GO_
KEGG.ORTHOLOGY: "https://bioregistry.io/kegg.orthology:"
KEGG.REACTION: "https://bioregistry.io/kegg.reaction:"
KEGG_PATHWAY: "https://bioregistry.io/kegg.pathway:"
MetaCyc: "https://bioregistry.io/metacyc.compound:"
RHEA: "https://bioregistry.io/rhea:"
SEED: "https://bioregistry.io/seed:"
biolink: https://w3id.org/biolink/vocab/
linkml: https://w3id.org/linkml/
nmdc: https://w3id.org/nmdc/
default_prefix: nmdc
default_range: string
classes:
GenomeFeature:
class_uri: nmdc:GenomeFeature
description: >-
A feature localized to an interval along a genome
comments:
- corresponds to an entry in GFF3
see_also:
- https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
slots:
- encodes
- end
- feature_type
- phase
- seqid
- start
- strand
- type
- feature_category
slot_usage:
seqid:
required: true
start:
required: true
end:
required: true
FunctionalAnnotationTerm:
class_uri: nmdc:FunctionalAnnotationTerm
aliases:
- function
- FunctionalAnnotation
is_a: OntologyClass
description: >-
Abstract grouping class for any term/descriptor that can be applied to a functional unit of a genome (protein, ncRNA, complex).
abstract: true
todos:
- decide if this should be used for product naming (Duncan, 2021-04-02)
- Retaining this even after removing Reaction. See todos on the Pathway and OrthologyGroup subclasses.
Pathway:
class_uri: nmdc:Pathway
aliases:
- biological process
- metabolic pathway
- signaling pathway
is_a: FunctionalAnnotationTerm
description: >-
A pathway is a sequence of steps/reactions carried out by an organism or community of organisms
id_prefixes:
- KEGG_PATHWAY
- COG
exact_mappings:
- biolink:Pathway
todos:
- If we reverted to including Reaction in the schema, then a Reaction would be a reasonable part_of a Pathway
- is Pathway instantiated in an MongoDB collection? Aren't Pathways searchable in the Data Portal?
deprecated: "not used. 2024-07-10 https://github.com/microbiomedata/nmdc-schema/issues/1881"
OrthologyGroup:
class_uri: nmdc:OrthologyGroup
is_a: FunctionalAnnotationTerm
description: >-
A set of genes or gene products in which all members are orthologous
id_prefixes:
- CATH
- EGGNOG
- KEGG.ORTHOLOGY
- PANTHER.FAMILY
- PFAM
- SUPFAM
- TIGRFAM
exact_mappings:
- biolink:GeneFamily
notes:
- KEGG.ORTHOLOGY prefix is used for KO numbers
todos:
- is OrthologyGroup instantiated in an MongoDB collection? Aren't Pathways searchable in the Data Portal?
FunctionalAnnotation:
class_uri: nmdc:FunctionalAnnotation
description: >-
An assignment of a function term (e.g. reaction or pathway) that is executed by a gene product,
or which the gene product plays an active role in.
Functional annotations can be assigned manually by curators, or automatically in workflows.
In the context of NMDC, all function annotation is performed
automatically, typically using HMM or Blast type methods
see_also:
- https://img.jgi.doe.gov/docs/functional-annotation.pdf
- https://github.com/microbiomedata/mg_annotation/blob/master/functional-annotation.wdl
slots:
- has_function
- subject
- was_generated_by
- type
- feature_category
slot_usage:
has_function:
notes:
- Still missing patterns for COG and RetroRules
- These patterns are not yet tied to the listed prefixes.
A discussion about that possibility had been started,
including the question of whether these lists are intended to be open examples or closed
was_generated_by:
description: provenance for the annotation.
notes: To be consistent with the rest of the NMDC schema we use the PROV annotation model, rather than GPAD
range: MetagenomeAnnotation
structured_pattern:
syntax: "{id_nmdc_prefix}:(wfmgan)-{id_shoulder}-{id_blade}{id_version}$"
interpolated: true
narrow_mappings:
- biolink:GeneToGoTermAssociation
slots:
feature_category:
range: ControlledIdentifiedTermValue
description: A Sequence Ontology term that describes the category of a feature
subject:
range: GeneProduct
has_function:
range: string
notes:
- "the range for has_function was asserted as functional_annotation_term/FunctionalAnnotationTerm,"
- "but is actually taking string arguments in MongoDB,"
- "and those are frequently fulltext, not CURIEs. MAM 2021-06-23"
pattern: "^(KEGG_PATHWAY:\\w{2,4}\\d{5}|KEGG.REACTION:R\\d+|RHEA:\\d{5}|MetaCyc:[A-Za-z0-9+_.%-:]+|EC:\\d{1,2}(\\.\\d{0,3}){0,3}|GO:\\d{7}|MetaNetX:(MNXR\\d+|EMPTY)|SEED:\\w+|KEGG\\.ORTHOLOGY:K\\d+|EGGNOG:\\w+|PFAM:PF\\d{5}|TIGRFAM:TIGR\\d+|SUPFAM:\\w+|CATH:[1-6]\\.[0-9]+\\.[0-9]+\\.[0-9]+|PANTHER.FAMILY:PTHR\\d{5}(\\:SF\\d{1,3})?)$"
gff_coordinate:
range: integer
minimum_value: 1
description: A positive 1-based integer coordinate indicating start or end
comments:
- "For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature."
seqid:
description: The ID of the landmark used to establish the coordinate system for the current feature.
range: string
todos:
- "change range from string to object"
strand:
todos:
- "set the range to an enum?"
description: >-
The strand on which a feature is located. Has a value of '+' (sense strand or forward strand) or
'-' (anti-sense strand or reverse strand).
exact_mappings:
- biolink:strand
encodes:
range: GeneProduct
description: >-
The gene product encoded by this feature.
Typically this is used for a CDS feature or gene feature which will encode a protein.
It can also be used by a nc transcript ot gene feature that encoded a ncRNA
todos:
- If we revert Reaction back into the schema, that would be a reasonable domain for this slot
end:
range: integer
is_a: gff_coordinate
description: The end of the feature in positive 1-based integer coordinates
comments: >-
- "constraint: end > start"
- "For features that cross the origin of a circular feature, end = the position of the end + the length of the landmark feature."
close_mappings:
- biolink:end_interbase_coordinate
feature_type:
range: string
description: "TODO: Yuri to write"
phase:
range: integer
minimum_value: 0
maximum_value: 2
description: >-
The phase for a coding sequence entity. For example, phase of a CDS as represented in a GFF3 with a value of 0, 1 or 2.
exact_mappings:
- biolink:phase
start:
is_a: gff_coordinate
description: The start of the feature in positive 1-based integer coordinates
close_mappings: biolink:start_interbase_coordinate