CESSDA CDC Consumer Indexer (an OSMH Consumer) for Metadata harvesting and ingestion into Elasticsearch. See the OSMH System Architecture Document for more information about The Open Source Metadata Harvester (OSMH).
The overall Software Maturity Level for this product, and the individual scores for each attribute can be found in the SML file.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
The following tools must be installed before compiling
- Java JDK 17
./mvnw clean test
To perform SonarQube analysis locally, run SonarQube and then execute
./mvnw sonar:sonar
./mvnw clean package
./mvnw spring-boot:run
To run the OSMH consumer with a custom profile, use the following command line:
java -jar target/pasc-oci*.jar --spring.profiles.active=${profile_name}
If no profile is specified, the default profile will be used. This profile is configured to use a local Elasticsearch instance hosted at http://localhost:9200
.
- Makes use of TDD
- When running integration tests, a standalone Elasticsearch server is launched
The application loads configuration in this order as defined by the Spring Boot Framework.
- Command line parameters
- e.g.
--logging.level.ROOT=DEBUG
sets logging level for all classes
- e.g.
- Environment Variables e.g.
SECURITY_USER_NAME
- Spring can use weak binding to convert environment variables into Java properties
- e.g.
SPRING_BOOT_ADMIN_USERNAME
converts tospring.boot.admin.username
- application-[dev,local,prod].yml
- dev, local and prod refer to Spring profiles
- A Spring profile can be specified by the command line
--spring.profiles.active
or the environment variableSPRING_PROFILES_ACTIVE
- A Spring profile can be specified by the command line
- See https://docs.spring.io/spring-boot/docs/3.1.x/reference/html/features.html#features.profiles for more details
- dev, local and prod refer to Spring profiles
- application.yml
See https://docs.spring.io/spring-boot/docs/3.1.x/reference/html/features.html#features.external-config for detailed documentation.
If the application is registered at a Spring Boot Admin server, all environment properties can be changed at runtime.
Changes made at runtime will be effective after a context reload but are lost after an application restart unless persisted in application.yml
The OSMH indexer has many settings that change the behaviour of the indexing process.
Property | Type | Description |
---|---|---|
baseDirectory |
Path | Directory to look for pipeline.json repository definitions. |
languages |
Languages | Configure which languages Elasticsearch indices will be created for. |
repos |
List | Manually configured repository definitions. |
oaiPmh.concatSeparator |
String | The string to use to concatenate repeated elements, concatenation is disabled if null . |
oaiPmh.metadataParsingDefaultLang.lang |
String | The language to fall back to if @xml:lang is not present. Individual repositories can override this setting. |
Elasticsearch properties are configured under the elasticsearch
key.
elasticsearch:
host: localhost # The Elasticsearch host
username: elastic # The username to use when connecting to a secured Elasticsearch cluster
password: examplePassword # The password to use when connecting to a secured Elasticsearch cluster
numberOfShards: 2 # The number of primary shards the created indices will have
numberOfReplicas: 0 # The number of replicas each primary shard has
The languages that the OSMH indexer will attempt to harvest are specified under languages
. These languages will be parsed and indexed into Elasticsearch. The default languages are specified below.
languages: ['cs', 'da', 'de', 'el', 'en', 'et', 'fi', 'fr', 'hu', 'it', 'nl', 'no', 'pt', 'sk', 'sl', 'sr', 'sv']
Custom mappings and settings can be defined in src/main/resources/elasticsearch. Mappings are global for all defined languages, whereas settings are selected per language. If the required mappings and settings can't be loaded, the index will not be created and an error will be logged.
In most cases, repositories to index are detected using instances of pipeline.json
. These are generated by the CESSDA Metadata Harvester and contain all the information needed to index the XMLs present alongside them.
Repositories are discovered by searching for instances of pipeline.json
in the baseDirectory
. The baseDirectory
can be specified using the --baseDirectory
command line parameter, or by specifying baseDirectory
in application.yml
.
Repositories are declared in application.yml and are specified under the key endpoints.repos
.
endpoints:
repos:
- url: http://194.117.18.18:6003/v0/oai
path: path/to/directory
code: APIS
name: 'Portuguese Archive of Social Information (APIS)'
preferredMetadataParam: oai_ddi25
defaultLanguage: pt
Property | Type | Description |
---|---|---|
url |
URI | Location of the OAI-PMH endpoint. |
code |
String | Short name of the repository, acts as a unique identifier. This is a mandatory field. |
name |
String | The friendly name of the repository, displayed in the user interface. This falls back to using code if `null. |
path |
Path | Location of the XML source files to be indexed. This is a mandatory parameter. |
preferredMetadataParam |
String | The metadata prefix used when harvesting from the OAI-PMH repository. |
defaultLanguage |
String | Used to set a language on an element that doesn't have @xml:lang defined. Defaults to oaiPmh.metadataParsingDefaultLang.lang if not set. This setting is only considered if oaiPmh.metadataParsingDefaultLang.active is set to true . |
Data Access is primarily read in DDI-C 2.5 from /codeBook/stdyDscr/dataAccs/useStmt/conditions
by checking for the values in Access Rights CV but free text values are also supported through the use of mappings JSON. Mappings for each repository can be specified in data_access_mappings.json by which XPath to use from XPaths.java and then which free texts to map to Open / Restricted. Any new XPaths that aren't already used for Data Access for some repository will also be needed to be added as a part of parseDataAccess
in CMMStudyMapper.java.
Repository names in mapping JSON should be the same as code set in harvesting configuration (which follows the configuration from cessda.cdc.aggregator.deploy).
- Maven - Dependency Management
Please read Contributing to CESSDA Open Source Software for information on contribution to CESSDA software.
You can find the list of all contributors here.
This project is licensed under the Apache 2 Licence - see the LICENSE file for details.