Skip to content

Core Services and End User Services

southeo edited this page Nov 21, 2024 · 5 revisions

Core Services

Persistent Identifier Infrastructure (PID Infrastructure)

DiSSCo uses the Handle System to resolve persistent identifiers, a global distributed system for resolving PIDs called Handles. A handle consists of two parts: a prefix and a suffix. When the Global Handle System receives a resolution request for a PID, it looks at the prefix and redirects to the Local Handle Server responsible for the management of that prefix (DiSSCo operates a Local Handle Server under the prefix 20.5000.1025/). The Local Handle Server then uses the suffix of the Handle to identify the appropriate PID record, and redirects the user to the location stored in the PID record.

DiSSCo is implementing FAIR Digital Object (FDO) Types for its digital objects. The Type of the digital object will determine what additional, machine-actionable information is stored in the PID record. The data model for these types is in ongoing development.

DiSSCo's Local Handle Server is deployed on an EC2 Server on AWS and links with a MongoDB document store. Handles can be created, updated, or tombstoned, and FDO records can be fetched, using the DiSSCo Handle Manager API.

Translator Services

The translation layer is the first step in the data ingestion process and the first point of entry into the DiSSCo infrastructure. This service retrieves data from various sources in a variety of data formats (Darwin, ABCD, local formats, etc.) with a variety of data exchange formats (RDF, XML, JSON, YAML, etc.) and architectures (REST, GraphQL, gRPC, etc.). This means that there will be a multitude of translators, not just one. The hope is that there will be a limited set of differences so that generic translators can be used. In addition to translating the data, the translator services should also be equipped to add messages to relevant data enrichment queues so that these processes can be triggered for processed OpenDS.

The main functions of the translator services are to connect to the DiSSCo Facility and retrieve/receive the data. Then translate the data to valid openDS and publish the Digital Specimen to a queue.

Processing Services

Once incoming data is translated into openDS, it enters the processing layer. The processing services validate and insert the openDS-compliant data into the data storage layer. This layer is the gatekeeper of the data storage layer; all modification actions on the data storage layer will need to go through the processing service. This means it will also need to check if an object is new, updated or unchanged and act accordingly.

The processing layer consists of three interconnected processing services: the specimen processor, the media processor, and the annotation processor service. When specimens are ingested, they pass through the processor, and then their respective media objects are processed through the media processor. This process results in "auto-accepted annotations", which are handled in the annotation processor. In a separate instance, the annotation processor also handles all incoming annotations outside of ingestion, either generated by a human or a machine annotation service.

Machine Annotation Services

DiSSCo leverages independent Machine Annotation Services (MASs) to enrich OpenDS data. These MASs accept specimen or media data in OpenDS format and generate one or more annotations, also adhering to the OpenDS specification, which are linked to their respective targets via PID. This approach allows MASs to be completely decoupled from the rest of the architecture, resulting in additional flexibility. As such, this approach enables partner organisations to develop custom MASs.

MASs are triggered asynchronously by queued messages containing the OpenDS data to be enriched. This demand-driven deployment strategy optimizes resource utilization. Once an MAS processes a message and generates annotations, the results are pushed back to a queue for further processing by the annotation processing service. This asynchronous architecture ensures efficient resource allocation and scalability.

Orchestration Service

The orchestration service is responsible for overseeing the pipelines for data ingestion. This service ensures that other services are be triggered and their process will be monitored. To enable persistent storage of state, a storage solution will be used.

The orchestration frontend provides administrators with the ability to schedule translator and/or enrichment services. The orchestration backend is responsible for triggering the requested services. The orchestration service manages the following resources:

  • Data Mapping: A data mapping allows a system administrator to control how their data is translated into the OpenDS Specification. It allows clients to set default mappings (e.g. set all instances of ods:OrganisationID to the institutions ROR) and field mappings (mapping which fields in the source data should populate the OpenDS data).
  • Source System: A source system is an endpoint from which specimen and media data can be harvested. Once a source system is registered, it is automatically re-ingested every 7 days. A user can also manually schedule an ingestion from the Orchestration Service.
  • Machine Annotation Services

Digital Specimen Repositories

There are three layers of storage within DiSSCo. The most recent version of a Digital Specimen (i.e. active data) is stored in a PostgreSQL database. All information pertaining to the Digital Specimen, both harmonized and unharmonized, is stored in this database.

Additionally, the harmonized data from the active Digital Specimen is stored in an ElasticSearch repository, allowing for rapid search functionality over harmonized data. Unharmonized data is not included in ElasticSearch to reduce the number of terms stored in the service, which would slow down performance.

Finally, older versions of the Digital Specimens, as well as change logs, are stored in a MongoDB Document Store. Whenever a Digital Specimen is updated, its previous version is persisted in the Document Store.

Authentication and Authorization Infrastructure (AAI)

Authentication and authorization plays a key role for all services in the infrastructure. Authentication refers to the process of signing in as a registered user whereas authorization refers to the rights and permissions to perform certain actions a user has. From the user’s perspective it is desired that a user can login into all DiSSCo services with the same credentials (single sign on, SSO). Another important feature is identity and access management (IAM) which involves the management of a user’s details and permissions within the system. DiSSCo uses Keycloak for identity management.

Indexing and API

APIs are essential for data accessibility and usability. The DiSSCo core infrastructure will expose its data via an API. This API is used both internally, for DiSSCo’s own data visualisation, as externally. This enables external users to access the data programmatically but also enables them to build their own data visualisation tools on top of the DiSSCo infrastructure. Endpoints for searching (powered by ElasticSearch) are publicly available. Authenticated users can also use the API to manage their annotations.

The API is based on the best practices coming from the JSON:API specification and is documented using the OpenAPI standard and a Swagger endpoint.

End-User Services

DiSSCover (Formally UCAS - Unified Curation and Annotation Service)

DiSSCover is a platform for DiSSCo to increase the value of specimen data through annotation by machines and subject matter experts. DiSSCover-E (Explore) is the main user interface for DiSSCo. It is a web-based platform for community members to explore and annotate digital specimens. DiSSCover-M enables machine annotations for automated data checking.

European Loans and Visits System (ELViS)

This service provides a unified way to request visits, loans, and virtual access. Virtual access requests through ELViS including support for collaborating on VA ideas and proposal submission, providing on-demand digitization as a new type of access.

Clone this wiki locally