Skip to content

Commit

Permalink
Merge branch 'develop' into 11044-edit-facet-bug
Browse files Browse the repository at this point in the history
  • Loading branch information
sekmiller committed Dec 2, 2024
2 parents ea97785 + 4624bb6 commit 301e918
Show file tree
Hide file tree
Showing 15 changed files with 715 additions and 24 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/deploy_beta_testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ on:
branches:
- develop

concurrency:
group: deploy-beta-testing
cancel-in-progress: false

jobs:
build:
runs-on: ubuntu-latest
Expand Down
5 changes: 5 additions & 0 deletions doc/release-notes/11049-oai-identifiers-as-pids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## When harvesting, Dataverse can now use the identifier from the OAI-PMH record header as the persistent id for the harvested dataset.

This will allow harvesting from sources that do not include a persistent id in their oai_dc metadata records, but use valid dois or handles as the OAI-PMH record header identifiers.

It is also possible to optionally configure a harvesting client to use this OAI-PMH identifier as the **preferred** choice for the persistent id. See the [Harvesting Clients API](https://guides.dataverse.org/en/6.5/api/native-api.html#create-a-harvesting-client) section of the Guides, #11049 and #10982 for more information.
2 changes: 2 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5254,6 +5254,7 @@ Shows a Harvesting Client with a defined nickname::
"dataverseAlias": "fooData",
"nickName": "myClient",
"set": "fooSet",
"useOaiIdentifiersAsPids": false
"schedule": "none",
"status": "inActive",
"lastHarvest": "Thu Oct 13 14:48:57 EDT 2022",
Expand Down Expand Up @@ -5288,6 +5289,7 @@ The following optional fields are supported:
- style: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
- customHeaders: This can be used to configure this client with a specific HTTP header that will be added to every OAI request. This is to accommodate a use case where the remote server requires this header to supply some form of a token in order to offer some content not available to other clients. See the example below. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
- allowHarvestingMissingCVV: Flag to allow datasets to be harvested with Controlled Vocabulary Values that existed in the originating Dataverse Project but are not in the harvesting Dataverse Project. (Default is false). Currently only settable using API.
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored. For example, as of writing this there is no way to configure a harvesting schedule via this API.
Expand Down
14 changes: 7 additions & 7 deletions doc/sphinx-guides/source/user/appendix.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,22 @@ Supported Metadata

Detailed below are what metadata schemas we support for Citation and Domain Specific Metadata in the Dataverse Project:

- `Citation Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=0>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/citation.tsv>`__): compliant with `DDI Lite <https://www.ddialliance.org/specification/ddi2.1/lite/index.html>`_, `DDI 2.5 Codebook <https://www.ddialliance.org/>`__, `DataCite 4.0 <https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf>`__, and Dublin Core's `DCMI Metadata Terms <https://dublincore.org/documents/dcmi-terms/>`__ . Language field uses `ISO 639-1 <https://www.loc.gov/standards/iso639-2/php/English_list.php>`__ controlled vocabulary.
- `Geospatial Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=4>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/geospatial.tsv>`__): compliant with `DDI Lite <https://www.ddialliance.org/specification/ddi2.1/lite/index.html>`_, `DDI 2.5 Codebook <https://www.ddialliance.org/>`__, `DataCite 4.0 <https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf>`__, and Dublin Core. Country / Nation field uses `ISO 3166-1 <https://en.wikipedia.org/wiki/ISO_3166-1>`_ controlled vocabulary.
- `Social Science & Humanities Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=1>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/social_science.tsv>`__): compliant with `DDI Lite <https://www.ddialliance.org/specification/ddi2.1/lite/index.html>`_, `DDI 2.5 Codebook <https://www.ddialliance.org/>`__, and Dublin Core.
- `Astronomy and Astrophysics Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=3>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/astrophysics.tsv>`__): These metadata elements can be mapped/exported to the International Virtual Observatory Alliance’s (IVOA)
- Citation Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/citation.tsv>`__): compliant with `DDI Lite <https://www.ddialliance.org/specification/ddi2.1/lite/index.html>`_, `DDI 2.5 Codebook <https://www.ddialliance.org/>`__, `DataCite 4.0 <https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf>`__, and Dublin Core's `DCMI Metadata Terms <https://dublincore.org/documents/dcmi-terms/>`__ . Language field uses `ISO 639-1 <https://www.loc.gov/standards/iso639-2/php/English_list.php>`__ controlled vocabulary.
- Geospatial Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/geospatial.tsv>`__): compliant with `DDI Lite <https://www.ddialliance.org/specification/ddi2.1/lite/index.html>`_, `DDI 2.5 Codebook <https://www.ddialliance.org/>`__, `DataCite 4.0 <https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf>`__, and Dublin Core. Country / Nation field uses `ISO 3166-1 <https://en.wikipedia.org/wiki/ISO_3166-1>`_ controlled vocabulary.
- Social Science & Humanities Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/social_science.tsv>`__): compliant with `DDI Lite <https://www.ddialliance.org/specification/ddi2.1/lite/index.html>`_, `DDI 2.5 Codebook <https://www.ddialliance.org/>`__, and Dublin Core.
- Astronomy and Astrophysics Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/astrophysics.tsv>`__): These metadata elements can be mapped/exported to the International Virtual Observatory Alliance’s (IVOA)
`VOResource Schema format <https://www.ivoa.net/documents/latest/RM.html>`__ and is based on
`Virtual Observatory (VO) Discovery and Provenance Metadata <https://perma.cc/H5ZJ-4KKY>`__.
- `Life Sciences Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=2>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/biomedical.tsv>`__): based on `ISA-Tab Specification <https://isa-specs.readthedocs.io/en/latest/isamodel.html>`__, along with controlled vocabulary from subsets of the `OBI Ontology <https://bioportal.bioontology.org/ontologies/OBI>`__ and the `NCBI Taxonomy for Organisms <https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/>`__.
- `Journal Metadata <https://docs.google.com/spreadsheets/d/13HP-jI_cwLDHBetn9UKTREPJ_F4iHdAvhjmlvmYdSSw/edit#gid=8>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/journals.tsv>`__): based on the `Journal Archiving and Interchange Tag Set, version 1.2 <https://jats.nlm.nih.gov/archiving/tag-library/1.2/chapter/how-to-read.html>`__.
- Life Sciences Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/biomedical.tsv>`__): based on `ISA-Tab Specification <https://isa-specs.readthedocs.io/en/latest/isamodel.html>`__, along with controlled vocabulary from subsets of the `OBI Ontology <https://bioportal.bioontology.org/ontologies/OBI>`__ and the `NCBI Taxonomy for Organisms <https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/>`__.
- Journal Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/journals.tsv>`__): based on the `Journal Archiving and Interchange Tag Set, version 1.2 <https://jats.nlm.nih.gov/archiving/tag-library/1.2/chapter/how-to-read.html>`__.

Experimental Metadata
~~~~~~~~~~~~~~~~~~~~~

Unlike supported metadata, experimental metadata is not enabled by default in a new Dataverse installation. Feedback via any `channel <https://dataverse.org/contact>`_ is welcome!

- `CodeMeta Software Metadata <https://docs.google.com/spreadsheets/d/e/2PACX-1vTE-aSW0J7UQ0prYq8rP_P_AWVtqhyv46aJu9uPszpa9_UuOWRsyFjbWFDnCd7us7PSIpW7Qg2KwZ8v/pub>`__: based on the `CodeMeta Software Metadata Schema, version 2.0 <https://codemeta.github.io/terms/>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/codemeta.tsv>`__)
- `Computational Workflow Metadata <https://docs.google.com/spreadsheets/d/13HP-jI_cwLDHBetn9UKTREPJ_F4iHdAvhjmlvmYdSSw/edit#gid=447508596>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/computational_workflow.tsv>`__): adapted from `Bioschemas Computational Workflow Profile, version 1.0 <https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE>`__ and `Codemeta <https://codemeta.github.io/terms/>`__.
- Computational Workflow Metadata (`see .tsv <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/computational_workflow.tsv>`__): adapted from `Bioschemas Computational Workflow Profile, version 1.0 <https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE>`__ and `Codemeta <https://codemeta.github.io/terms/>`__.

Please note: these custom metadata schemas are not included in the Solr schema for indexing by default, you will need
to add them as necessary for your custom metadata blocks. See "Update the Solr Schema" in :doc:`../admin/metadatacustomization`.
Expand Down
1 change: 1 addition & 0 deletions src/main/java/edu/harvard/iq/dataverse/DataCitation.java
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,7 @@ private GlobalId getPIDFrom(DatasetVersion dsv, DvObject dv) {
if (!dsv.getDataset().isHarvested()
|| HarvestingClient.HARVEST_STYLE_VDC.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_ICPSR.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_DEFAULT.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_DATAVERSE
.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())) {
if(!isDirect()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,12 +150,16 @@ public DatasetDTO processXML( XMLStreamReader xmlr, ForeignMetadataFormatMapping

}

// Helper method for importing harvested Dublin Core xml.
// Helper methods for importing harvested Dublin Core xml.
// Dublin Core is considered a mandatory, built in metadata format mapping.
// It is distributed as required content, in reference_data.sql.
// Note that arbitrary formatting tags are supported for the outer xml
// wrapper. -- L.A. 4.5
public DatasetDTO processOAIDCxml(String DcXmlToParse) throws XMLStreamException {
return processOAIDCxml(DcXmlToParse, null, false);
}

public DatasetDTO processOAIDCxml(String DcXmlToParse, String oaiIdentifier, boolean preferSuppliedIdentifier) throws XMLStreamException {
// look up DC metadata mapping:

ForeignMetadataFormatMapping dublinCoreMapping = findFormatMappingByName(DCTERMS);
Expand Down Expand Up @@ -185,18 +189,37 @@ public DatasetDTO processOAIDCxml(String DcXmlToParse) throws XMLStreamException

datasetDTO.getDatasetVersion().setVersionState(DatasetVersion.VersionState.RELEASED);

// Our DC import handles the contents of the dc:identifier field
// as an "other id". In the context of OAI harvesting, we expect
// the identifier to be a global id, so we need to rearrange that:
// In some cases, the identifier that we want to use for the dataset is
// already supplied to the method explicitly. For example, in some
// harvesting cases we'll want to use the OAI identifier (the identifier
// from the <header> section of the OAI record) for that purpose, without
// expecting to find a valid persistent id in the body of the DC record:

String identifier = getOtherIdFromDTO(datasetDTO.getDatasetVersion());
logger.fine("Imported identifier: "+identifier);
String globalIdentifier;

String globalIdentifier = reassignIdentifierAsGlobalId(identifier, datasetDTO);
logger.fine("Detected global identifier: "+globalIdentifier);
if (oaiIdentifier != null) {
logger.fine("Attempting to use " + oaiIdentifier + " as the persistentId of the imported dataset");

globalIdentifier = reassignIdentifierAsGlobalId(oaiIdentifier, datasetDTO);
} else {
// Our DC import handles the contents of the dc:identifier field
// as an "other id". Unless we are using an externally supplied
// global id, we will be using the first such "other id" that we
// can parse and recognize as the global id for the imported dataset
// (note that this is the default behavior during harvesting),
// so we need to reaassign it accordingly:
String identifier = selectIdentifier(datasetDTO.getDatasetVersion(), oaiIdentifier, preferSuppliedIdentifier);
logger.fine("Imported identifier: " + identifier);

globalIdentifier = reassignIdentifierAsGlobalId(identifier, datasetDTO);
logger.fine("Detected global identifier: " + globalIdentifier);
}

if (globalIdentifier == null) {
throw new EJBException("Failed to find a global identifier in the OAI_DC XML record.");
String exceptionMsg = oaiIdentifier == null ?
"Failed to find a global identifier in the OAI_DC XML record." :
"Failed to parse the supplied identifier as a valid Persistent Id";
throw new EJBException(exceptionMsg);
}

return datasetDTO;
Expand Down Expand Up @@ -344,8 +367,20 @@ private FieldDTO makeDTO(DatasetFieldType dataverseFieldType, FieldDTO value, St
return value;
}

private String getOtherIdFromDTO(DatasetVersionDTO datasetVersionDTO) {
public String selectIdentifier(DatasetVersionDTO datasetVersionDTO, String suppliedIdentifier) {
return selectIdentifier(datasetVersionDTO, suppliedIdentifier, false);
}

private String selectIdentifier(DatasetVersionDTO datasetVersionDTO, String suppliedIdentifier, boolean preferSuppliedIdentifier) {
List<String> otherIds = new ArrayList<>();

if (suppliedIdentifier != null && preferSuppliedIdentifier) {
// This supplied identifier (in practice, his is likely the OAI-PMH
// identifier from the <record> <header> section) will be our first
// choice candidate for the pid of the imported dataset:
otherIds.add(suppliedIdentifier);
}

for (Map.Entry<String, MetadataBlockDTO> entry : datasetVersionDTO.getMetadataBlocks().entrySet()) {
String key = entry.getKey();
MetadataBlockDTO value = entry.getValue();
Expand All @@ -363,6 +398,16 @@ private String getOtherIdFromDTO(DatasetVersionDTO datasetVersionDTO) {
}
}
}

if (suppliedIdentifier != null && !preferSuppliedIdentifier) {
// Unless specifically instructed to prefer this extra identifier
// (in practice, this is likely the OAI-PMH identifier from the
// <record> <header> section), we will try to use it as the *last*
// possible candidate for the pid, so, adding it to the end of the
// list:
otherIds.add(suppliedIdentifier);
}

if (!otherIds.isEmpty()) {
// We prefer doi or hdl identifiers like "doi:10.7910/DVN/1HE30F"
for (String otherId : otherIds) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,13 @@ public JsonObjectBuilder handleFile(DataverseRequest dataverseRequest, Dataverse
}

@TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)
public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest, HarvestingClient harvestingClient, String harvestIdentifier, String metadataFormat, File metadataFile, Date oaiDateStamp, PrintWriter cleanupLog) throws ImportException, IOException {
public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest,
HarvestingClient harvestingClient,
String harvestIdentifier,
String metadataFormat,
File metadataFile,
Date oaiDateStamp,
PrintWriter cleanupLog) throws ImportException, IOException {
if (harvestingClient == null || harvestingClient.getDataverse() == null) {
throw new ImportException("importHarvestedDataset called with a null harvestingClient, or an invalid harvestingClient.");
}
Expand Down Expand Up @@ -244,8 +250,8 @@ public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest, Harve
} else if ("dc".equalsIgnoreCase(metadataFormat) || "oai_dc".equals(metadataFormat)) {
logger.fine("importing DC "+metadataFile.getAbsolutePath());
try {
String xmlToParse = new String(Files.readAllBytes(metadataFile.toPath()));
dsDTO = importGenericService.processOAIDCxml(xmlToParse);
String xmlToParse = new String(Files.readAllBytes(metadataFile.toPath()));
dsDTO = importGenericService.processOAIDCxml(xmlToParse, harvestIdentifier, harvestingClient.isUseOaiIdentifiersAsPids());
} catch (IOException | XMLStreamException e) {
throw new ImportException("Failed to process Dublin Core XML record: "+ e.getClass() + " (" + e.getMessage() + ")");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -252,8 +252,16 @@ public void setAllowHarvestingMissingCVV(boolean allowHarvestingMissingCVV) {
this.allowHarvestingMissingCVV = allowHarvestingMissingCVV;
}

// TODO: do we need "orphanRemoval=true"? -- L.A. 4.4
// TODO: should it be @OrderBy("startTime")? -- L.A. 4.4
private boolean useOaiIdAsPid;

public boolean isUseOaiIdentifiersAsPids() {
return useOaiIdAsPid;
}

public void setUseOaiIdentifiersAsPids(boolean useOaiIdAsPid) {
this.useOaiIdAsPid = useOaiIdAsPid;
}

@OneToMany(mappedBy="harvestingClient", cascade={CascadeType.REMOVE, CascadeType.MERGE, CascadeType.PERSIST})
@OrderBy("id")
private List<ClientHarvestRun> harvestHistory;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1052,6 +1052,7 @@ public String parseHarvestingClient(JsonObject obj, HarvestingClient harvestingC
harvestingClient.setHarvestingSet(obj.getString("set",null));
harvestingClient.setCustomHttpHeaders(obj.getString("customHeaders", null));
harvestingClient.setAllowHarvestingMissingCVV(obj.getBoolean("allowHarvestingMissingCVV", false));
harvestingClient.setUseOaiIdentifiersAsPids(obj.getBoolean("useOaiIdentifiersAsPids", false));

return dataverseAlias;
}
Expand Down
Loading

0 comments on commit 301e918

Please sign in to comment.