Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bklog: Deliverable - As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API #26

Closed
poikilotherm opened this issue Aug 10, 2020 · 25 comments
Labels
bklog: Deliverable D: ImproveJsonValidation https://github.com/IQSS/dataverse-pm/issues/26

Comments

@poikilotherm
Copy link

poikilotherm commented Aug 10, 2020

As a "bklog: Deliverable" This is decomposed into smaller issues.

  • Each of the smaller issues gets the label "D: ImproveJsonValidation".
  • This issue, the only issue to have both labels (bklog: Deliverable, D: ImproveJsonValidation), will stay in the Dataverse_Global_Backlog project forever.
  • It will stay in it's present column until the team feels like the issue has no smaller issues that need need to be broken off in order to resolve this issue
  • At that point, this issue stays in the Dataverse_Global_backlog, but changes it's status to: "Clear of the Backlog"

This is related to AUSSDA/pyDataverse#48 and to dvcli as a CLI tool for Dataverse. (Tagging @skasberger here)
On the Dataverse side of life, this is related to all mighty IQSS/dataverse#6030 and loosley coupled to IQSS/dataverse#4451 (which might make the creation of the schema easier).

When creating a new dataset via the web UI, you will be provided with a nice interface and validation before a dataset is created. What is required, what is available as a field, etc is nicely integrated into the UI both for users and curators.

However, this is not the case with uploading new datasets via API. Before you send a JSON representation, there is not possibility to validate the dataset in terms of metadata schemas, required fields, etc.

It would be nice to provide an API endpoint to retrieve a JSON Schema for a given Dataverse, that contains precise constraints and requirements, what your dataset JSON has to look like and what other fields might be available.

This is useful not only for pre-creation validation, but also for automatic creation of options on command lines (think autocompletion, ncurses interfaces, ...) or client-side forms.

@poikilotherm
Copy link
Author

The all mighty @pdurbin has found a very closely related issue: IQSS/dataverse#3060

@poikilotherm
Copy link
Author

poikilotherm commented Jan 20, 2022

While writing the HERMES concept paper, we once more stumbled over this. I learned that Zenodo is offering such a schema at https://zenodo.org/schemas/deposits/records/legacyrecord.json

Recent talk with @atrisovic also revealed this would be very nice to create a crosswalk CodeMeta <-> Dataverse JSON IQSS/dataverse#7844

Tagging this as @hermes-hmc related

@djbrooke
Copy link

Thanks @poikilotherm - this has come up as we've discussed integration with some Harvard library systems as well. PRs welcome if you have the availability to work on this.

@skasberger
Copy link

skasberger commented May 11, 2022

This is a central part for a future release, as mapping between different schemas and data types seem to be of high relevance for the future. You can find out more about my thoughts and work on this in pyDataverse here: gdcc/pyDataverse#102

To move on, it would totally make sense to connect such activity, to create a common understanding of data structures, validation processes and mappings.

@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2022

Quoting @4tikhonov at https://groups.google.com/g/dataverse-community/c/TqXmICwr0io/m/qEZDvPwuAAAJ

"There is Dataverse schema in .nt format, you can easily get it in the knowledge graph with rdflib and serialise as json-ld:
https://github.com/Dans-labs/semaf-client/tree/master/schema "

@pdurbin
Copy link
Member

pdurbin commented Dec 7, 2022

@mreekie
Copy link
Collaborator

mreekie commented Feb 27, 2023

There are at least 2 subtasks

  • Create a JSON schema
  • Create the code to validate the JSON schema
  • Size is unknown so we'll mkae this a deliverable and break off smaller issues.
  • Next step: Discuss at the next tech hour.

@poikilotherm
Copy link
Author

poikilotherm commented Mar 8, 2023

I just had a lovely chat with @JR-1991 while he was still in Boston about this (and other things).

For easyDataverse, he would welcome having

  1. one general JSON Schema that describes and validates the general parts of a Dataset JSON and it's overall structure but not looking at the semantics of the metadata and
  2. downloadable JSON Schema per collection that also contain in-depth validators for the metadata fields (which is dependent on the collections configuration). He will be generating the necessary classes for this on-the-fly anyway.

The general schema might need to be versioned, so downloadable via the instance, too.

NOTE: It might help adding the latest version to https://www.schemastore.org to have autocomplete when writing JSON manually.

@qqmyers
Copy link
Member

qqmyers commented Mar 8, 2023

FWIW: The current json has a lot of schema-ish info in it already - would it make sense to remove that and make it flatter like the json-ld format? More work up front perhaps (and a breaking change) but a simpler schema and more readability.

@poikilotherm
Copy link
Author

poikilotherm commented Mar 8, 2023

Yes, absolutely! We talked about that, too, but I forgot. We think alike you - meta information about the docs structure should be in a schema, not in a dataset's JSON.

We could make it somewhat none-breaking by accepting it as deprecated input for a while in the schema, but ignoring it in the parser that translates to DTO/POJO.

@qqmyers
Copy link
Member

qqmyers commented Mar 8, 2023

This could potentially sync the processing of the json and json-ld - e.g. if the json-ld can be made to look just like the json with an added @context which I think could be possible.

@beepsoft
Copy link

beepsoft commented Mar 9, 2023

I don't know if it helps, but you may also consider CEDAR as an example. Their metadata schema (https://more.metadatacenter.org/tools-training/outreach/cedar-template-model) is described as a JSON Schema, which defines the structure of a JSON-LD document, which should be the actual filled metadata.

@pdurbin
Copy link
Member

pdurbin commented Mar 9, 2023

@beepsoft you're actually using CEDAR with Dataverse, right? I forget, did you give us a sample we can play with? 😄

@beepsoft
Copy link

beepsoft commented Mar 9, 2023

@pdurbin Yes I am. :-) As part of this we now have or will have soon the following tools:

  • CEDAR template JSON Schema import as Dataverse metadata block
  • Dataverse mdb export back to TSV
  • Dataverse mdb export to CEDAR template JSON Schema (via the TSV format)

We use these to keep DV metadata blocks and CEDAR templates in sync.

I'm not sure if I gave you an example CEDAR template, but here is one, for example:

https://openview.metadatacenter.org/templates/https:%2F%2Frepo.metadatacenter.org%2Ftemplates%2Fdc3fa214-88f4-49dd-b56b-f4552b2d3474

At the end of the page you can find "Advanced View", there you can copy the JSON Schema.

For more examples, you can easily register at

https://cedar.metadatacenter.org/

with github or ORCID account, and then can take a look at the public templates:

https://cedar.metadatacenter.org/dashboard?sharing=shared-with-everybody&folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2F6a419e8c-7f49-4426-9c8d-320de6ec061f

This is my usual test template:

CEDAR-NCBI Human Tissue

https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/dc3fa214-88f4-49dd-b56b-f4552b2d3474?sharing=shared-with-everybody&folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2F6a419e8c-7f49-4426-9c8d-320de6ec061f

Once you go to the editor, at the bottom you can always see the JSON Schema associated with the template.

@mreekie
Copy link
Collaborator

mreekie commented Mar 9, 2023

Sizing:

  • People have been asking for this for long time.

There are at least 2 subtasks

  • Create a JSON schema
  • Create the code to validate the JSON schema
  • Size is unknown so we'll mkae this a deliverable and break off smaller issues.
  • Next step: Discuss at the next tech hour.

Sizing:

  • We need @poikilotherm in on a discussion and sizing.
  • Discussed today quickly:
    • We do already have a JSON schema library that knows how to validate a schema. Used in prov.json
    • The gist of this is about creating the Json schema.
    • Once we have the schema we can use it in multiple places.
    • This is specifically about our internal JSON schema.

@poikilotherm
Copy link
Author

* We need @poikilotherm in on a discussion and sizing.

😉

  * We do already have a JSON schema library that knows how to validate a schema. Used in prov.json

A comment: Yes, we have. It's old and outdated and should be replaced. Can be done in the same go.

I'm 65% sure that library is also not capable of creating a schema, just do validation of JSON with a given schema. So we might need to write out the schema using some JSON-P.

@pdurbin
Copy link
Member

pdurbin commented Mar 9, 2023

@beepsoft thanks! Interesting!

@mreekie
Copy link
Collaborator

mreekie commented Mar 14, 2023

Sizing:

  • This is going to need to be a parent issue with smaller steps

Steps:

  • Agree on general structure - create a schema that matches our dataset JSON without adding too much specifics on the metadatda. Check the Structure. (no code) (10)
  • Use this schema to update the code. We currently have a library that checks the JSON schema (wc3provenance schema). As part of this solution we should not feel limited to using our current library. (On the page about the JSON schema there are multiple java library options available for validation.)(10)
  • Create API endpoints to retrieve a JSON schema that also has info about the required meta-data so that we can create a full-fledged validation of a dataset. That API end point has a paramenter that is the dataverse collection that you are targetting for deposit. (10)

reference:

Next Steps:

  • Set this issue up as a deliverable (mike)
  • Create the 3 issues. ( @poikilotherm )

@mreekie mreekie changed the title As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API Bklog: Deliverable: As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API Mar 15, 2023
@mreekie mreekie changed the title Bklog: Deliverable: As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API bklog: Deliverable - As a system integrator, I would appreciate a JSON Schema for validating my dataset JSON before uploading via API Mar 15, 2023
@mreekie
Copy link
Collaborator

mreekie commented Mar 15, 2023

Next Steps:

Set this issue up as a deliverable (mike)

@poikilotherm I've set this up as a backlog deliverable.

Note - As a backlog deliverable:

  • It resides in dataverse-pm (the dev team found it confusing for these to stay in the dataverse repo)
  • The title is modified
  • The description is modified
  • Added labels: "D: ImproveJsonValidation" & "bklg: Deliverable"

Once we don't need this anymore we just move it in the backlog to "clear of the backlog"
(Sorry about the loss of the original labels. )


When you create the three issues:

  • Add the label: "D: ImproveJsonValidation" to each
  • Add the label: Size: 10 to each
  • Add them directly to the backlog to the column "SPRINT READY"

At that point they will be queued for an upcoming sprint.

Thank you so much stepping up during the meeting to create the issues!

@mreekie mreekie transferred this issue from IQSS/dataverse Mar 15, 2023
@mreekie mreekie added D: ImproveJsonValidation https://github.com/IQSS/dataverse-pm/issues/26 bklog: Deliverable labels Mar 15, 2023
@mreekie
Copy link
Collaborator

mreekie commented Mar 22, 2023

Sizing:

  • Oliver mentioned that when he is able to get to the issue creation he'll drop a note here.

@mreekie
Copy link
Collaborator

mreekie commented Mar 22, 2023

Grooming:

  • Wahoo! We're on our way.

Next Steps:

  • Until the 3 spawned issues get worked, them we can leave this hanging around in "Sprint Needs Sizing"
  • Once we decide that our objective for this has been met, we just change the state to "Clear of Backlog" in the backlog project.

@mreekie
Copy link
Collaborator

mreekie commented Mar 27, 2023

Prio:

  • This issue is going to be closed based on the creation of the three issues.

@mreekie mreekie closed this as completed Mar 27, 2023
@mreekie mreekie moved this from SPRINT- NEEDS SIZING to Clear of the Backlog in IQSS Dataverse Project Mar 27, 2023
@pdurbin
Copy link
Member

pdurbin commented Dec 6, 2023

For anyone watching this issue, yesterday we merged this PR:

@pdurbin
Copy link
Member

pdurbin commented Jul 2, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bklog: Deliverable D: ImproveJsonValidation https://github.com/IQSS/dataverse-pm/issues/26
Projects
Status: No status
Status: Done
Development

No branches or pull requests

8 participants