-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Dataverse produce valid DDI codebook 2.5 XML #3648
Comments
Thanks @jomtov for moving this issue from our support system! I thought it might be helpful to give some background on the issue, list what might need to change when the DDI xml is made valid, and describe the errors. As background for anyone else interested, the DDI xml that Dataverse generates for each dataset (and datafile) needs to follow DDI's schema, so that other repositories and applications using DDI xml can use it (e.g. during harvesting). To answer jomtov's question, I think Dataverse's xml would need to be corrected. After fixing the errors and making sure the XML is valid, these are what I imagine will need to be adjusted:
There are five errors here, described in the dataverse_1062_philipsonErrorTypes.txt file in jomtov's post: 1. DDI schema doesn't like "DVN" as a value for source in 2. DDI schema doesn't like the URI attribute being called "URI":
As jomtov points out, the keyword URI is called vocabURI in Dataverse. Unless there's a reason why it's called URI in the DDI XML, I think this is as easy as changing "URI" to "vocabURI", which is okay with the schema.
3. DDI schema doesn't like where "contact" info is placed:
The DDI schema says that sumDscr shouldn't hold things like contact info. The contact element should be under useStmt:
4 and 5. DDI schema doesn't like Two of the elements that can be nested under Lastly, this isn't one of the five errors reported, but DDI likes
|
There may be more validation errors (since these two datasets have only some of all possible metadata). @raprasad and I talked yesterday about trying to validate all (or a greater number?) of Harvard Dataverse's DDI XML to find additional errors and make sure the DDI XML is always valid. There was also some discussion about when and how Dataverse validates the DDI it generates, and making sure that process is working. |
@jomtov would you be able to tell us what tools you're using to validate against a DDI 2.5 schema? I documented how to validate against DDI 2.0 using MSV (Multi Schema Validator) at http://guides.dataverse.org/en/4.6/developers/tools.html#msv but I seem to recall that DDI 2.5 is more complicate and requires multiple schema file or something. I don't think I ever figured out to use MSV to validate DDI 2.5. Do you use some other tool? Any tips for me? Thanks! |
@pdurbin, I used the schema found in the schemaLocation of the exported xml-files of the item examples above: |
The DDI 2.5 test fails with this: `src-resolve: Cannot resolve the name 'xml:lang' to a(n) 'attribute declaration' component.` We should be exporting valid DDI 2.5.
Ah, thanks @jomtov. Judging from its Wikipedia page, the Oxygen XML Editor is not free and open source. Bummer. In a491cd9 I just pushed some code to demonstrate the difficultly I've seen in validating against that The failing Travis build from that commit at demonstrates the error I'm seeing:
That's from https://travis-ci.org/IQSS/dataverse/builds/208627544#L3805 Does anyone have any idea how to fix this test? Here's the line that's failing:
|
Well, @pdurbin, https://www.corefiling.com/opensource/schemaValidate.html (also on GitHub) is a free xml validator online that seems to work anyway. I uploaded the codebook.xsd and one of the erroneous export-items from above and validated - here attached as .txt -files, since .xsd and .xml are not supported by GitHub, to be 'reconverted' again before use: True, the validator did not find some of the other referenced schemas, but they are not relevant here, and all the specific codebook.xsd validation errors seems to be identified anyway (scrolling down in the results):
Maybe this could be useful? |
@jomtov thanks for the pointer to https://www.corefiling.com/opensource/schemaValidate.html which I just tried. It seems to work great. It's perfect for one-off validation of an XML file against a schema. To be clear, what I was trying to say in #3648 (comment) is that I'd like to teach Dataverse itself to validate XML against a schema. It works for DDI 2.0 but not DDi 2.5. I still don't understand why. For the Java developers reading this, a491cd9 is the commit I made the other day. |
Hi @jomtov. Here's the corrected DDI xml for the first dataset: valid_DDIXMLforItem1.zip. At first I misinterpreted the errors you posted, but I've got it down now. It's valid as far as I can tell. The online tool you mentioned keeps timing out for me. When you get the chance, could you check to see if the corrected DDI xml is valid with the tool you use? A while back @pdurbin posted a DDI xml file for a dataset with most of the metadata fields that Dataverse exports. That file and the corrected file (validated with "topic classification" included) are here: invalid_and_valid_DDIxml.zip. Most of the corrections were just moving elements around in the xml, but some involved changing which fields the elements go into (e.g. CC0, or what's entered into Terms of Use if CC0 isn't chosen, can't go into
I'd like to rename this issue to something like "Make Dataverse produce valid DDI codebook 2.5 xml", which would involve "teaching Dataverse itself to validate" DDI xml against the codebook 2.5 schema. |
@jomtov are you ok with renaming this issue as @jggautier suggests? |
@pdurbin and @jggautier, Yes, I am OK with the renaming suggested. (Sorry for belated answer, been on vacation off-line for a while.) Keep up the good work! |
The xml files in my earlier comment (ZIP file) don't have most of the metadata in the Terms tab, so the corrections don't take that metadata into account. Current exported DDI from Dataverse has most of the Terms metadata in the right DDI element, but just in the wrong place in the xml. The exception is the Terms of Access metadata field - whatever's entered there is exported to DDI's |
I wrote a doc describing what I think are most of the mapping changes needed: https://drive.google.com/open?id=1ICXRL8DP5fCGYiRyRphh_3OotNaWOOak1VmnyufBNsM I'm pointing our ADA friends to this issue and doc, especially the part about the Terms metadata, since I think the invalid mapping has complicated their own work mapping ADA's DDI to Dataverse's for their planned migration. |
I rewrote the XML validator in Dataverse an now have a test to validate XML we send to DataCite (it operates on a static file) and I added a FIXME to use the validator with DDI as well:
|
…a violations *for the control dataset we are using in our tests*. There is almost certainly more that needs to be done. #3648
Just to clarify a couple of things from an earlier discussion:
"Looking specifically at what has been reported" may not easily apply. This is a very old issue, with a lot of back-and-forth (that's very hard to read), and many of the things reported earlier have already been fixed in other PRs. So I assumed that the goal of the PR was "make Dataverse produce valid DDI". (i.e., if something not explicitly mentioned here is obviously failing validation, it needed to be fixed too - it did not make sense to make a PR that would fix some things, but still produce ddi records that fail validation; especially since people have been waiting for it to be fixed since 2017). The previously discussed automatic validation - adding code to the exporter that would validate in real time every ddi record produced, and only cache it if it passes the validation - does make sense to be left as a separate sprint-sized task. (the validation itself is not hard to add; but we'll meed to figure out how to report the errors). I have enabled the validation test in To clarify, in the current state, the exporter in my branch is producing valid ddi xml for our control "all fields" dataset, plus all the other datasets used in our tests, and whatever I could think of to test. It does NOT guarantee that there is no possible scenario where it can still output something illegal! So, yes, it is important to add auto-validation. And, if and when somebody finds another such scenario, we will treat it as a new issue. A couple of arbitrary decisions had to be made. I will spell it out in the PR description. My general approach was, if something does not translate from our metadata to the ddi format 1:1, just drop it and move on. We don't assume that it's a goal, to preserve all of our metadata when exporting DC, it's obvious that only a subset of our block fields can be exported in that format. But it's not a possibility with the ddi either, now that we have multiple blocks and the application is no longer centered around quantitative social science. So, no need to sweat a lost individual field here and there. |
To check compatibility I use the following two validators:
|
@kaczmirek |
…he API), and the corresponding control ddi export. #3648
…uide importable, using kcondon's fixes. #3648
…made multiple in PR #9254; would be great to put together a process for developers who need to make changes to fields in metadata blocks that would help them to know of all the places where changes like this need to be made. (not the first time, when something breaks, in ddi export specifically, after a field is made multiple). #3648
Forwarded from the ticket:
https://help.hmdc.harvard.edu/Ticket/Display.html?id=245607
Hello,
I tried to validate two items exported to DDI from dataverse.harvard.edu with codebook.xsd (2.5) and got the same types of validation errors described below for item1 (below the line, should work as a well-formed xml-file):
Item 1:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BAMCSI
Item 2: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/P4JTOD
What could be done about it (else than meddling with the schema?)
Best regards,
Joakim Philipson
Research Data Analyst, Ph.D., MLIS
Stockholm University Library
Stockholm University
SE-106 91 Stockholm
Sweden
Tel: +46-8-16 29 50
Mobile: +46-72-1464702
E-mail: [email protected]
http://orcid.org/0000-0001-5699-994X
<xs:attribute name="source" default="producer">
xs:simpleType
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="archive"/>
<xs:enumeration value="producer"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
dataverse_1062_philipsonErrorTypes.txt
The text was updated successfully, but these errors were encountered: