-
Notifications
You must be signed in to change notification settings - Fork 1
Data Storage Solutions
DiSSCo uses several solutions to store and retrieve data. Each has been picked for their particular strengths and abilities. We try to use industry-standard open source technologies which follow general principles. This should enable us to change the particular tooling when we feel there is a better solution available. In general, we distinguished three types of data within DiSSCo:
- Identifiers -> These are Persistent Identifiers (PIDs). DiSSCo stores identifiers for all non DOI objects within its own infrastructure. These so-called Handles provide a unique identifier as well as metadata about the identifier.
- Active data -> Data that is actively used by the user, whether man or machine. The user can only act on the latest version of the data. These changes need to be fast, atomic, and consistent. When a user adds change the object this change should be immediate, timing is of importance.
- Indexed data -> To provide full search capabilities we need to index the active data. Users will then be able to search on particular fields or through the whole object. As we need to limit the amount of fields indexed we will only index the harmonized fields.
- Event data -> This is historic data. The events provide information about the previous version and what changes were made and by whom to come to the current version. Event data are immutable and provide limited ways to search through it.
Each of these three different types of storage requires different solutions. In the next sections we will go over the choices that have been made.
For the storage of identifiers we are limited to the functionality provided by the Local Handle Server (LHS). The LHS provide information for PID resolution and ensures that the request through different resolution providers (hdl or doi) will end up at DiSSCo. This handle server by default stores its data on disk. However, due to limitations in the API (no batch functionality) we switched to a solution which uses a documentstore as storage. This means that all identifier records are stored in a MongoDB instance.
Each field in the FDO record results in a new document in the database. The ID (handle or DOI) is indexed, and for media and specimen records, we also index normalised versions of the media URL and the physical specimen ID, respectively. This allows the Handle API to retrieve the records quickly without needing the Handle of the object, ensuring that no media nor specimen is issued a duplicate handle.
For active data, we have chosen for a relational database solution with as implementation PostgreSQL. A relational database is fully atomicity, consistency, isolation, and durability (ACID), which means we can build on safe and persistent data. To track the relations between different objects we can use the relation structure of the relational databases. For unstructured data, we use a JSON column, which allows us to store unstructured data into a relational database while keep the strength through the normalised columns. While this provides a good solution for our active data there, are limitations. Several relational databases provide full-text functionality; however, not to the extent as the functionality provided by a specialised tool. There are limits to the scalability of a relation database as it generally scales vertically instead of horizontally.
For storage of indexed data (data we want to search on), we use a separate tool specialised in providing indexed and aggregated data. ElasticSearch is currently the industry-standard and used by all major parties. It provides the ability to index large volumes of data and make it fully searchable. To prevent cluttering of indexes, we will only index fields harmonised fields. We know what values to expect in those fields and can provide search and aggregation functionality.
Each change in an object triggers an event. This event describes the change that is made in the form of a JSON Patch and also includes the new state of the object. This enables use to both recreate the object from scratch through the changes, but also to directly retrieve individual states. Event data is immutable. Search capabilities on this data can be limited as full search capabilities can only be done on the active data. Searching will primarily happen on PID, PID and version, and timestamp. The amount of event data could grow rapidly as each change on an object triggers an event. This means we need a scalable solution providing only limited search capabilities. We therefore believe a document store is best suited for this type of data. We will use MongoDB as implementation which is backed by a large community and provides all functionality needed.