-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25
Closed
4 of 20 tasks
Labels
pm.GREI-d-1.4.1
NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues
Project: NIH GREI
Tasks related to the NIH GREI project
Comments
mreekie
changed the title
PM.Feature: Harvesting Work for the NIH deliverable
PM.Epic: Harvesting Work for the NIH deliverable
May 9, 2022
mreekie
changed the title
PM.Epic: Harvesting Work for the NIH deliverable
PM.Epic: Harvesting Implementation for the NIH deliverable
May 9, 2022
Not finished defining. It seems like some of the work may apply to other NIH objectives that Len has mentioned. A Next steps:
|
4 tasks
mreekie
changed the title
PM.Epic: Harvesting Implementation for the NIH deliverable
Collection: Keep track of list of issues that we want to address as part of 1.4.1
Jan 9, 2023
3 tasks
3 tasks
Met with Leonid today.
|
mreekie
added
the
pm.GREI-d-1.4.1
NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues
label
Mar 31, 2023
59 tasks
2024/01/29
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
pm.GREI-d-1.4.1
NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues
Project: NIH GREI
Tasks related to the NIH GREI project
The first step was to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done. Leonid's conclusions at the conclusion of Issue 8574 does that.
This Epic tracks the work as proposed by Leonid:
I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt.
(Note that I'm interpreting the word "prioritizing" as assigning some order of importance to these issues and bugs, what makes sense to fix first and/or what's ready to be worked on vs. what needs more discussion; not as scheduling them for specific sprints, etc.!)
The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too)
The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.
the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:
The following issues are about the DDI exporter producing XML that is not valid under the schema.
Similarly, the following issues are requests for changes in how we export DC; I believe these need to be reviewed/discussed, perhaps together?
The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):
(ordered by importance, IMO)
There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:
It is obviously super important and should be prioritized too; but it's more of a design discussion and research, rather than something we already can start coding.
The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with.
More Background:
This is in support of:
an NIH grant "The Harvard Dataverse repository: A generalist repository integrated with a Data Commons",
Aim 4: Improve harvesting and packaging standards to share metadata and data across repositories,
There is a lot packaged into Aim 4
Improved Harvesting via the OAI-PMH standard
Improved support for Bagit
Improved support for Signposting
The scope for this issue is Harvesting via the OAI-PMH standard
Aim 4:
Improve harvesting and packaging standards to share metadata and data across repositories
Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.
A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.
To help with this, **we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data. **
Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.
Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.
Related documents
Notes on Dataverse Deliverablas for NIH OTA
NIH OTA Progress Notes
NIH OTA
Exposing and harvesting metadata using the OAI metadata harvesting protocol: A tutoria (2001)
Getting Started with BagIt in 2018
NIH OTA
bagit from Library of Congress video
The text was updated successfully, but these errors were encountered: