Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25

mreekie · 2022-05-09T17:28:56Z

The first step was to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done. Leonid's conclusions at the conclusion of Issue 8574 does that.

This Epic tracks the work as proposed by Leonid:

I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt.
(Note that I'm interpreting the word "prioritizing" as assigning some order of importance to these issues and bugs, what makes sense to fix first and/or what's ready to be worked on vs. what needs more discussion; not as scheduling them for specific sprints, etc.!)

The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too)

Update and reorganize the XOAI dependencies under local_lib dataverse#8372

The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.

the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:

Figure out whether, or how to support the extended ISO 639-3 list of languages dataverse#8578 [metadata, import]
Can't harvest when Dublin core field language is set dataverse#8139 [metadata, import]
Feature Request/Idea: Sanitize languages controlled vocabulary values dataverse#8243 [metadata, import]

The following issues are about the DDI exporter producing XML that is not valid under the schema.

PM.Epic: Make sure our DDI is valid against the schema - deprecated dataverse#8701

Similarly, the following issues are requests for changes in how we export DC; I believe these need to be reviewed/discussed, perhaps together?

The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):

Consider making export time stamps more granular dataverse#8630 [backend]
Revisit/reimplement the concept of a "Harvested file". dataverse#8629 [backend]
Add support for exporting more metadata formats via XSLT crosswalks dataverse#5960 [backend]
Consider reorganizing metadata import code dataverse#8631 [backend]
(ordered by importance, IMO)

There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:

deprecated Spike: Initiate or participate in a working group in support of NIH effort to improve harvesting capabilities (Deprecated) dataverse#8575
It is obviously super important and should be prioritized too; but it's more of a design discussion and research, rather than something we already can start coding.

The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with.

More Background:

This is in support of:

an NIH grant "The Harvard Dataverse repository: A generalist repository integrated with a Data Commons",
Aim 4: Improve harvesting and packaging standards to share metadata and data across repositories,

There is a lot packaged into Aim 4

Improved Harvesting via the OAI-PMH standard
Improved support for Bagit
Improved support for Signposting
The scope for this issue is Harvesting via the OAI-PMH standard

Aim 4:

Improve harvesting and packaging standards to share metadata and data across repositories

Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.

A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.

To help with this, **we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data. **

Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.

Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.

Related documents

Notes on Dataverse Deliverablas for NIH OTA
NIH OTA Progress Notes
NIH OTA
Exposing and harvesting metadata using the OAI metadata harvesting protocol: A tutoria (2001)
Getting Started with BagIt in 2018
NIH OTA
bagit from Library of Congress video

mreekie · 2022-05-09T20:36:05Z

Not finished defining. It seems like some of the work may apply to other NIH objectives that Len has mentioned. A

Next steps:

I left off at the start of this paragraph: "The following issues are about the DDI exporter producing XML ..." All of the issues prior to that now have at a minimum the labels: Feature: Harvesting, NIH OTA DC, pm.epic.nih_harvesting
Figure out where these fit. Should the be part of another epic for the NIH grant?

mreekie · 2022-05-16T22:27:35Z

In pm.sprint.2022_05_11

mreekie · 2023-01-09T19:09:03Z

Met with Leonid today.

renamed this issue to reflect how it is being used.
The plan is to populate in this issue the list of issues that scopes out 1.4.1 (9036)
At that point, we'll be able to retire this issue and paste the results into 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 #10.
The figuring it out part is tracked by Spike: Inventory and prioritize all existing Harvesting related issues #24

cmbz · 2024-01-29T15:24:13Z

2024/01/29

I have updated the following epic with remaining open harvesting issues: GREI 3: HDV Task - Improve OAI-PMH Harvesting #171
Please follow the new epic to track progress on harvesting issues.

mreekie self-assigned this May 9, 2022

mreekie changed the title ~~PM.Feature: Harvesting Work for the NIH deliverable~~ PM.Epic: Harvesting Work for the NIH deliverable May 9, 2022

mreekie changed the title ~~PM.Epic: Harvesting Work for the NIH deliverable~~ PM.Epic: Harvesting Implementation for the NIH deliverable May 9, 2022

mreekie mentioned this issue May 16, 2022

PM.Epic: Make sure our DDI is valid against the schema - deprecated IQSS/dataverse#8701

Closed

4 tasks

mreekie changed the title ~~PM.Epic: Harvesting Implementation for the NIH deliverable~~ Collection: Keep track of list of issues that we want to address as part of 1.4.1 Jan 9, 2023

mreekie mentioned this issue Jan 9, 2023

Spike: Inventory and prioritize all existing Harvesting related issues #24

Closed

3 tasks

sync-by-unito bot mentioned this issue Mar 3, 2023

4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 #10

Closed

3 tasks

poikilotherm mentioned this issue Jan 30, 2023

Feature Request: enable static OAI-PMH sets for the collection tree IQSS/dataverse#9344

Open

mreekie transferred this issue from IQSS/dataverse Mar 10, 2023

mreekie added the pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues label Mar 31, 2023

cmbz added the Project: NIH GREI Tasks related to the NIH GREI project label Jan 3, 2024

cmbz mentioned this issue Jan 3, 2024

Epic: GREI 3 - Search and Browse #117

Open

7 tasks

cmbz assigned cmbz and unassigned mreekie Jan 3, 2024

cmbz mentioned this issue Jan 22, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting #171

Open

59 tasks

cmbz closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25

Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25

mreekie commented May 9, 2022 •

edited

Loading

mreekie commented May 9, 2022 •

edited

Loading

mreekie commented May 16, 2022

mreekie commented Jan 9, 2023

cmbz commented Jan 29, 2024

Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25

Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25

Comments

mreekie commented May 9, 2022 • edited Loading

More Background:

There is a lot packaged into Aim 4

Aim 4:

Related documents

mreekie commented May 9, 2022 • edited Loading

mreekie commented May 16, 2022

mreekie commented Jan 9, 2023

cmbz commented Jan 29, 2024

mreekie commented May 9, 2022 •

edited

Loading

mreekie commented May 9, 2022 •

edited

Loading