-
Notifications
You must be signed in to change notification settings - Fork 1
Information for Machine Annotation Service (MAS) Developers
Thank you for your interest in offering a Machine Annotation Service (MAS) for the DiSSCo community! These services can operate in the DiSSCo data infrastructure to make annotations in an automated way on one or multiple digital objects. DiSSCo is an evolving research infrastructure designed to support digitized natural science collections. It provides Digital Specimens and Digital Media objects from diverse collections in a single harmonized data model, openDS. The Digital Specimen data infrastructure offers a collaborative space for the community to annotate these Digital Specimens and Media Objects for purposes like digitisation, quality enhancement and data enrichment.
While individuals can manually annotate resources using DiSSCo's DiSSCover platform, as demonstrated in our sandbox environment, there’s also a significant opportunity for machine actors to annotate specimens on a large scale. Machine Annotation Services (MASs) are computational tools that automatically produce annotations on digital objects when certain conditions are met. These services can be triggered either automatically, when a digital object is first ingested into the system, or manually through a user in DiSSCover. MASs often use external APIs or AI to produce their annotations. A MAS can use the results of another MAS and a MAS can be used in combination with user validation afterwards.
The DiSSCo development team has developed some example MASs, which you can explore here. We look forward to your contributions to enhancing the DiSSCo community!
A template for MAS development is provided here.
Within DiSSCo, two kinds of digital objects can be annotated: Digital Specimens and Digital Media, which adhere to the openDS specification. To learn more about openDS, you can visit the Terms Site and the OpenDS GitHub. A solid understanding with the openDS specification will be useful in preparing developing MASs.
The annotations also follow a specific schema. You can find the expected annotation format here. The schema's components are described later in this document.
In order for DiSSCo to utilize a MAS, it must first be containerized and available in a public container. This ensures the MAS can be deployed and scaled efficiently within the DiSSCo Infrastructure.
MAS providers register MASs within the DiSSCo Orchestration Service which manages the deployment of such resources. To register a MAS, you must log in to the orchestration service using a Google Account, ORCID, or institutional login. Select the "Machine Annotation Services" tab, and click on the "Add MAS" button to the right. This will lead you to a form to register your MAS.
Upon registration, MAS providers have the option to set filters that determine which objects their MAS will annotate. Filters are an invaluable tool for ensuring that only relevant digital objects are processed by the MAS. For example, if a MAS is specifically designed to annotate media objects, applying a filter such as ods:type = "https://doi.org/21.T11148/bbad8c4e101e8af01115" will exclude any non-relevant objects, ensuring that the MAS only operates on the appropriate data.
When a filter is including on a MAS, the MAS will only be available for resources that meet the specified criteria. Therefore, it is strongly recommended to include filters to optimize the accuracy and efficiency of the annotation process. This targeted approach helps maintain the quality and relevance of annotations generated within the DiSSCo ecosystem.
In the MAS data model there is space for both environmental variables as well as secrets. The environmental variables can be used to set parameters for an algoritm or feature toggle specifics parts of the MAS. The environmental variables should not be used to put in secret variables. They are not encrypted and can be read by anybody. For secret variables some additional actions are needed to ensure the secret is well encrypted and can be injected into the application. For this step the DiSSCo development team is required add the secret to the DiSSCo Secret Store. This is a manual action which hasn't been automated (yet). This means that the secret should first be securely provided to the DiSSCo development team. For the transfer of the secret several options are available all of which should include a two factor authentication. For example a zip-file can be secured with a password and send via email. The password should then come through a different medium such as a text message or a chat message. It is also possible to use a website as https://onetimesecret.com/ preferably with a passphrase which is provided through a different medium.
DiSSCo will then insert the secret into our Secret Store and will supply the MAS provider with the ods:secretKeyRef
. The MAS provider can use this secretKeyRef in the data model and inject the secret into an environmental variable for which the schema:name
can be used.
MAS providers can opt to enable their service for batch annotations, a feature particularly useful for services that can be applied to multiple specimens simultaneously. Batch annotations help reduce computational overhead by allowing DiSSCo to identify resources with identical input data. Instead of running the MAS separately for each resource, DiSSCo applies the annotation across all relevant resources, streamlining the process and conserving resources.
However, it's important to note that a MAS must be specifically designed with batching in mind, as it requires additional metadata to support this functionality. For detailed guidance on developing a batch-enabled MAS, please refer to the "Batching Annotations" section later in this guide.
Currently DiSSCo does not have a production environment yet, but MASs can already be implemented and tried through DiSSCo's Sandbox environment. In the Sandbox, all data including the annotations will be deleted from time to time. When a production environment is available, likely end 2024, developers can offer their MAS in the production environment when their institution is formally recognised as a DiSSCo service provider. Any institution can become a service provider. For this the institution needs an approved service delivery plan or other formal agreement with DiSSCo ERIC, when the ERIC is established (planned for 2026).
DiSSCo communicates with Machine Annotation Services (MASs) through a Kafka queue. Kafka is an asynchronous messaging system. It allows events, such as tasks or data updates, to be sent between systems in a highly reliable and scalable way. When a user schedules a MAS via the DiSSCover platform, DiSSCo dispatches a Kafka message to the designated MAS, initiating the annotation process.
Each scheduled MAS task is assigned a unique jobId, which is included in the Kafka message. This jobId must be returned to DiSSCo exactly as it was received to ensure proper tracking and processing. The Kafka message is sent as a JSON object structured as follows:
{
"jobId": "20.5000.1025/ABC-123-XYZ",
"digitalSpecimen OR digitalMedia": {
// follows OpenDS specification
}
"batchingRequested":true
}
The boolean value "batchingRequested" indicates whether or not the scheduling user has requested this MAS to be performed as a batch, if applicable.
The MAS returns its annotations in the form of an AnnotationEvent, according to this schema.
Where "jobId" is the unaltered jobId from the triggering Kafka message, "annotations" is a list of one or more annotations adhering to the annotation data model for MAS, and "batchMetadata" is an optional field describing the property used to create the annotation, described in the next section.
After processing, the MAS returns its annotations in the form of an AnnotationEvent, adhering to this schema.
The response includes the following key
- "jobId": The unaltered jobId from the triggering Kafka
- "annotations": A list of one or more annotations, each following the annotation event data model for MAS.
- "batchMetadata" (optional): A field that describes the property used to generate the annotation, particularly relevant if batch processing was involved. Further details on this can be found in the next section of this guide.
When a MAS consistently produces the same output from a given input, it’s possible to identify all resources that meet those criteria and annotate them with the same result. For example, a MAS that uses the field $.ods:hasEvents.ods:hasLocation.dwc:locality to generate annotations for ods:hasEvents.ods:hasLocation.dwc:georeference—as demonstrated by the Mindat Georeferencing MAS—could apply this method.
To enable this process, MASs can include batchMetadata in their response. The information provided in the batchMetadata allows DiSSCo to generate search queries that identify resources matching the criteria originally used to produce the annotation. This allows the system to apply the same annotation across multiple resources without needing to re-run the MAS for each individual object.
The schema for batchMetadata can be found here, and it includes the following key fields:
For each item in the batchMetadata There must be an annotation with the corresponding "placeInBatch" value.
The schema for the batchMetadata is found here.
placeInBatch: Integer that indicates which annotation this batch metadata corresponds to. There MUST be a corresponding "placeInBatch" value in one annotation in the event. If more than one annotations have the same placeInBatch value, only the first annotation will be used to create a base annotation.
inputField: The full JSONPath of the field used to generate MAS annotation, in JSONPath block notation, e.g. ['ods:DigitalSpecimen']['ods:hasIdentifications'][*]['ods:hasTaxonIdentifications'][*]['dwc:taxonRank']
. Array indexes must be omitted - instead, use wildcards.
inputValue: value stored at the specified JSONPath.
Batching can only be done if the MAS sends annotations of one Type of object in one event - either Digital Specimens OR Media Objects.
Thank you for your commitment to enhancing the DiSSCo community through the development and deployment of MASs. By following the guidelines outlined in this guide, you are playing a crucial role in advancing biodiversity research. Whether through novel machine learning approaches, georeferencing tools, or automated data checks, the contribution of help improve natural science collection data quality across Europe. Machine Annotation Services not only support ongoing research, but also empower the scientific community to engage in meaningful post-publication curation. This collaborative approach enhances the value of digitized collections, ensuring they remain relevant and up to-date long after their initial publication.