Skip to content

Parses DDI XML and converts it into CMM Metadata. Part of the OSMH harvester

License

Notifications You must be signed in to change notification settings

cessda/cessda.cdc.osmh-indexer.cmm

Repository files navigation

SQAaaS badge

SQAaaS badge shields.io

Build Status Bugs Code Smells Coverage Duplicated Lines (%) Lines of Code Maintainability Rating Quality Gate Status Reliability Rating Security Rating Technical Debt Vulnerabilities

OSMH Consumer Indexer (PaSC-OCI)

CESSDA CDC Consumer Indexer (an OSMH Consumer) for Metadata harvesting and ingestion into Elasticsearch. See the OSMH System Architecture Document for more information about The Open Source Metadata Harvester (OSMH).

Quality - Software Maturity Level

The overall Software Maturity Level for this product, and the individual scores for each attribute can be found in the SML file.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

The following tools must be installed before compiling

  • Java JDK 17

Test it

./mvnw clean test

Sonar it

To perform SonarQube analysis locally, run SonarQube and then execute

./mvnw sonar:sonar

Build it

./mvnw clean package

Run it

./mvnw spring-boot:run

Run it — with a specified profile

To run the OSMH consumer with a custom profile, use the following command line:

java -jar target/pasc-oci*.jar --spring.profiles.active=${profile_name}

If no profile is specified, the default profile will be used. This profile is configured to use a local Elasticsearch instance hosted at http://localhost:9200.

Notes

  • Makes use of TDD
  • When running integration tests, a standalone Elasticsearch server is launched

Deployment

At startup

The application loads configuration in this order as defined by the Spring Boot Framework.

  • Command line parameters
    • e.g. --logging.level.ROOT=DEBUG sets logging level for all classes
  • Environment Variables e.g. SECURITY_USER_NAME
    • Spring can use weak binding to convert environment variables into Java properties
    • e.g. SPRING_BOOT_ADMIN_USERNAME converts to spring.boot.admin.username
  • application-[dev,local,prod].yml
  • application.yml

See https://docs.spring.io/spring-boot/docs/3.1.x/reference/html/features.html#features.external-config for detailed documentation.

At Runtime

If the application is registered at a Spring Boot Admin server, all environment properties can be changed at runtime.

Changes made at runtime will be effective after a context reload but are lost after an application restart unless persisted in application.yml

Configuring the indexer

The OSMH indexer has many settings that change the behaviour of the indexing process.

Property Type Description
baseDirectory Path Directory to look for pipeline.json repository definitions.
languages Languages Configure which languages Elasticsearch indices will be created for.
repos List Manually configured repository definitions.
oaiPmh.concatSeparator String The string to use to concatenate repeated elements, concatenation is disabled if null.
oaiPmh.metadataParsingDefaultLang.lang String The language to fall back to if @xml:lang is not present. Individual repositories can override this setting.

Elasticsearch Properties

Elasticsearch properties are configured under the elasticsearch key.

elasticsearch:
  host: localhost # The Elasticsearch host
  username: elastic # The username to use when connecting to a secured Elasticsearch cluster
  password: examplePassword # The password to use when connecting to a secured Elasticsearch cluster
  numberOfShards: 2 # The number of primary shards the created indices will have
  numberOfReplicas: 0 # The number of replicas each primary shard has

Language settings

The languages that the OSMH indexer will attempt to harvest are specified under languages. These languages will be parsed and indexed into Elasticsearch. The default languages are specified below.

languages: ['cs', 'da', 'de', 'el', 'en', 'et', 'fi', 'fr', 'hu', 'it', 'nl', 'no', 'pt', 'sk', 'sl', 'sr', 'sv']

Custom mappings and settings can be defined in src/main/resources/elasticsearch. Mappings are global for all defined languages, whereas settings are selected per language. If the required mappings and settings can't be loaded, the index will not be created and an error will be logged.

Indexing a repository

In most cases, repositories to index are detected using instances of pipeline.json. These are generated by the CESSDA Metadata Harvester and contain all the information needed to index the XMLs present alongside them.

Repositories are discovered by searching for instances of pipeline.json in the baseDirectory. The baseDirectory can be specified using the --baseDirectory command line parameter, or by specifying baseDirectory in application.yml.

Explicitly declaring a repository

Repositories are declared in application.yml and are specified under the key endpoints.repos.

endpoints:
  repos:
    - url: http://194.117.18.18:6003/v0/oai
      path: path/to/directory
      code: APIS
      name: 'Portuguese Archive of Social Information (APIS)'
      preferredMetadataParam: oai_ddi25
      defaultLanguage: pt
Property Type Description
url URI Location of the OAI-PMH endpoint.
code String Short name of the repository, acts as a unique identifier. This is a mandatory field.
name String The friendly name of the repository, displayed in the user interface. This falls back to using code if `null.
path Path Location of the XML source files to be indexed. This is a mandatory parameter.
preferredMetadataParam String The metadata prefix used when harvesting from the OAI-PMH repository.
defaultLanguage String Used to set a language on an element that doesn't have @xml:lang defined. Defaults to oaiPmh.metadataParsingDefaultLang.lang if not set. This setting is only considered if oaiPmh.metadataParsingDefaultLang.active is set to true.

Data Access mappings

Data Access is primarily read in DDI-C 2.5 from /codeBook/stdyDscr/dataAccs/useStmt/conditions by checking for the values in Access Rights CV but free text values are also supported through the use of mappings JSON. Mappings for each repository can be specified in data_access_mappings.json by which XPath to use from XPaths.java and then which free texts to map to Open / Restricted. Any new XPaths that aren't already used for Data Access for some repository will also be needed to be added as a part of parseDataAccess in CMMStudyMapper.java.

Repository names in mapping JSON should be the same as code set in harvesting configuration (which follows the configuration from cessda.cdc.aggregator.deploy).

Built With

  • Maven - Dependency Management

Contributing

Please read Contributing to CESSDA Open Source Software for information on contribution to CESSDA software.

Versioning

Authors

You can find the list of all contributors here.

License

This project is licensed under the Apache 2 Licence - see the LICENSE file for details.

Acknowledgments

About

Parses DDI XML and converts it into CMM Metadata. Part of the OSMH harvester

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages