-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align or merge DataCite metadata exports #5889
Comments
Related? Silent publishing failure when not all fields required by Datacite are present #7551 |
Good point. It could be related if/when Dataverse repositories start sending more metadata to DataCite and the dependencies among the child fields of any of that metadata is the same as the dependencies of the child fields in the Producer compound field (which right now is the only field causing those silent failures). |
@qqmyers and I are also looking at this given that what we're currently sending to DataCite is indeed rather inadequate. I'm not at all concerned about the naming algorithm. If anything, I think it's a good idea to try to guess organizational names. Given this, I think a single export format makes sense. In terms of items missing from both exports, the citation metadata looks complete, but the individual subject blocks seem to have some stuff missing. From @philippconzett 's list at #7072 that's most notably the geography data, which we'd also like to capture. We're viewing this as pretty high priority given how widely DataCite data are used (e.g. the fact that we're not linking up our funding information to the PID graph isn't great) -- is there anything we can do to help move this along? |
Thanks @adam3smith. @jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think? |
Is my #7077 related here, too? (Going to work on that, you folks know... Funding...) |
@poikilotherm May be related, but I think we'd want to move these forward independently IMHO. I think much of the discussion around #7077 will happen as part of the Software Metadata WG. |
Awesome! @jggautier -- I think you have this covered, but if there's anything you'd like another set of eyes on or a 2nd opinion just tag and/or email me. |
Thanks @adam3smith. Great to hear there's more interest in prioritizing this! I'm all on board with saving the closed vs. restricted data categorization problem (#5920) for another day if it moves this issue forward. I think there are a few other things we should consider:
|
Thanks Julian.
|
Users can set
|
Thanks! Dataverse currently doesn't have a nameType option, which is why we need some sort of algorithmic solution to determine this.
Since the name list also sounds like it works less well for non-Western names, I'd actually now be somewhat nervous about this. Do you have contacts at some of the Chinese DV installations we could ask or are there Dataverse Collections at Harvard more likely to contain non-Western creator names so we could check? If this is indeed fairly common, labeling a significant number of people with non-Western names as institutions seems a lot more problematic than the reverse and I'd go back on my opinion above... |
The presence of a comma is unfortunately not a good heuristic for DataCite, as many repositories use "givenName familyName", instead of "familyName, givenName". The best solution is really using givenName and familyName. The reason we use a name dictionary is mainly that adoption of givenName/familyName is too low. |
Just to be clear -- what we're after here is not to change what Datacite does but what Dataverse does in creating metadata submitted to Datacite -- Datacite just comes in because the Dataverse algorithm for handling names is derived from your code. I think adding separate name fields would be quite challenging at this point, though I agree that it'd be much preferable. |
I understand. One important reason for "guessing" personal names is citation styles and formatted citations (as you of course know). DataCite introduced givenName and familyName a few years ago and it is still optional as it is indeed challenging to implement. |
Thanks @mfenner as always! @adam3smith, there was a lot of discussion in #4257 about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at #4257 (comment) still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator. Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smith, @qqmyers, @djbrooke. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue? @adam3smith wrote:
I agree and spoke with @scolapasta about the use cases and limits of #7606. My understanding is that it wouldn't address cases like the Related Publication field. @scolapasta could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional. The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606 to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on. @djbrooke wrote:
Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation work starts:
Then maybe we could aim for working on implementation in the following sprint? What do you all think? |
One small comment: the author of the library we use for names (https://github.com/berkmancenter/namae) is @inukshuk who @adam3smith knows from citationstyles work, maybe it is worth reaching out to him, e.g. to ask about handling of non-Western names. |
One quick thought: It might be simple to add a person/org choice field and just use ‘the algorithm’ to pre-populate that for existing data, i.e. we only use it to handle legacy info rather than in an ongoing way. (Could even make it something that could be optional if admins don’t think it works well for their installations.)
…-- Jim
From: Julian Gautier [mailto:[email protected]]
Sent: Wednesday, March 10, 2021 12:06 PM
To: IQSS/dataverse
Cc: qqmyers; Mention
Subject: Re: [IQSS/dataverse] Align (or merge) DataCite metadata exports (#5889)
Thanks @mfenner<https://github.com/mfenner> as always!
@adam3smith<https://github.com/adam3smith>, there was a lot of discussion in #4257<#4257> about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at #4257 (comment)<#4257 (comment)> still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator.
Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smith<https://github.com/adam3smith>, @qqmyers<https://github.com/qqmyers>, @djbrooke<https://github.com/djbrooke>. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue?
@adam3smith<https://github.com/adam3smith> wrote:
There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606<#7606> looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?
I agree and spoke with @scolapasta<https://github.com/scolapasta> about the use cases and limits of #7606<#7606>. My understanding is that it wouldn't address cases like the Related Publication field. @scolapasta<https://github.com/scolapasta> could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional.
The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606<#7606> to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on.
@djbrooke<https://github.com/djbrooke> wrote:
@jggautier<https://github.com/jggautier> if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?
Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation:
* a review of the metadata mapping. Like @adam3smith<https://github.com/adam3smith> wrote, that shouldn't be too much trouble
* a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if an creator is a person or organization)
Then maybe we could aim for working on implementation in the following sprint? What do you all think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#5889 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABTLRTY6EVOK4WUM5GGM5ALTC6KHDANCNFSM4HQF4N2Q>.
|
That sounds good to me.
We'd be happy with this -- the more control we have over metadata the better -- but there may be concern about too many UI elements for self-deposit repositories. |
Sorry for joining the discussion so late, I just want to add a reference to the inprogress update to the OpenAIRE DataArchive guidelines that will be based on the Datacite version 4 schema https://openaire-guidelines-for-data-archive-managers.readthedocs.io/en/latest/index.html This is essentially the new version of the guidelines that we were requested to develop for in 2018 (to be more specific at this time we have looked to the Datacite schema v4.1) and was contributed to Dataverse in 4.14 The OpenAIRE team is still working on the new version, I take the freedom to ping them on this thread openaire/guidelines-data-archives#2 so that they will be aware of the work in progress on the Dataverse community |
Hi @abollini. I don't think you're late at all. The status of this issue was brought up in a recent Dataverse community meeting, so I thought it would be helpful to write here that the plans being discussed in this GitHub issue for how to proceed haven't been started or finalized. I think it's great that the OpenAIRE team will be aware of this discussion. Thanks! |
Just noticed that in the DataCite export's of installations running Dataverse software v5.9 and maybe all earlier versions, parentheses are added to the Author Affiliation values that are put in DataCite's creator > affiliation element: The screenshot is from an export from Demo Dataverse, running v5.9. It's also done in this export from DataverseNL (v5.9) Maybe this is because the code is getting what's displayed on the dataset page instead of what's entered in the field on the edit metadata page? Looks like that was the issue when Author Affiliation values were wrapped in parenthesis in the search API results (#6570 (comment)) The OpenAIRE export doesn't include the parenthesis, so I mention this bug in this issue since it seems natural that merging these two exports, or aligning them more, would also fix this parenthesis bug. |
Thanks @sbarbosadataverse. I don't have any objections to this being prioritized and moved to sprint ready. I'm worried we won't hear back from folks from OpenAIRE by the end of the sprint next Wednesday. I'll reach out to @abollini again in openaire/guidelines-data-archives#2 @poikilotherm I'm hesitant to try to better understand what generators are. But could you write about the benefits? For example, does it make it easier to change the exporters? |
Currently, for DataCite we use a template approach, combined with XML processing. For DDI we use AFAIK an XML only processing approach. For our JSON based exports we use mostly JSON processing. The point is: all of this is hand crafted. The implementation is done by us and we need to make sure the serialized output matches the specifications involved. We also provide the mapping from our internal model to the target model with these serializers. When using generators, parts of the process are put upside down. You start with the spec (XML XSD, Json Schema, Open API...) and you use a tool to generate model classes out of these. The result are classes that can be serialized to the target output data using the Jakarta standard included data binding mechanisms. Beyond that, these classes can also be used for the inverted process: deserialization from some data to the model. An example would be importing DataCite XML from OAI-PMH: use the data binding to get a populated Java model of the data. As the model classes are generated from the spec, they are known to fully transform all of the spec into the model. We might not use all of the available modeling, but at least we can easily extend without much hassle. As long as the generator tools don't make mistakes, the data binding is always going to be valid output data as well as always map from correct input data back to the model. Using our own implementations for de-/serialization requires extensive testing and also lot of manual work to implement every change etc. The availability of schemas and model classes for them allows a much stricter enforcing of data validity at compile and runtime. Constraints about the data from the spec are transported into the data model, allowing for simpler interaction with the model from code as well as the Java compiler assisting you to build it. For the exporters, having schemas around (and I'm talking about more than just DataCite) will also allow for a clearer defined data exchange between the core application and plugged in exporters. The model classes provide Data Transfer Objects as a side product. Also, upgrading schemas is improved. We can include a generated data model version for any version of a schema. If we want to change the supported schema version, the Java code can help us determine what to change and how. It's much clearer in code what is supported and what isn't. Changing a version means change the import path for them classes. Brain dump out. |
To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'. If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment. |
This issue has an open PR... ... so I'm reopening it. It'll be closed when we merge it. |
We're now using this PR instead to close this issue: |
Thanks for the heads up @pdurbin. I'm going to keep this issue open, or I guess re-open it after that PR is merged, so that I can see what decisions were made and what goals and questions aren't addressed yet. |
@jggautier sounds good. Perhaps we can create a new issue with any remaining items. |
This issue is meant to record the differences between Dataverse's two newest metadata exports as of v4.14, "DataCite"/"Datacite" and "OpenAIRE"/"oai_datacite", and discussion about how to align (or possibly merge) the very similar exports.
As part of v4.10 (released in Dec. 2018), Dataverse makes available through the UI, API and over OAI-PMH dataset metadata in the DataCite schema (#5043). This lets Dataverse export dataset metadata in a widely-used, discipline-agnostic schema that's more standardized than Schema.org and has more metadata than Dublin Core.
As part of v4.14 (released in May 2019), Dataverse makes available through the UI, API and over OAI-PMH DataCite metadata that complies with OpenAIRE requirements (#4257). Repositories need to follow these requirements in order for their dataset metadata to be made discoverable (harvested) by OpenAIRE (OpenAIRE EXPLORE). The OpenAIRE metadata requirements follow the DataCite schema, with some differences between OpenAIRE and DataCite listed in their documentation.
What both exports are called depending on the export method:
Both metadata exports are based on DataCite 4 and are meant to be valid against the DataCite 4 schema (although the xml records available over OAI-PMH in "Datacite" format reference DataCite's 3.1 schema). But Dataverse exports them as separate formats for several reasons:
Ideally, Dataverse would export only one metadata record, made available through the UI, API and over OAI-PMH, that follows the DataCite schema and is also OpenAIRE compliant. The way things are now, where Dataverse exports two different metadata records based on DataCite but different, people have been confused about the differences between the two metadata exports called "DataCite" and "OpenAIRE" in the UI and called "Datacite" and "oai_datacite" in the API endpoints and made available over OAI-PMH.
But we may want to maintain two metadata exports because:
We should decide if:
The text was updated successfully, but these errors were encountered: