Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to create a ODIS node for (Harvard) Dataverse and searching for all ocean data sets in it via ODIS? #481

Open
gaelforget opened this issue Nov 14, 2024 · 9 comments

Comments

@gaelforget
Copy link

A prototypical application would be : search dataverse through ODIS to find sizable, regularly formatted, data sets for a given ocean region (e.g. coastal ocean off of New England, US)

Below I just document the bits and pieces we looked at today in discussing this idea with @pbuttigieg

@gaelforget
Copy link
Author

ping @pdurbin , @atrisovic

@gaelforget gaelforget changed the title help needed to create a ODIS node for Harvard data verse and searching for all ocean data sets in it via ODIS how to create a ODIS node for (Harvard) Dataverse and searching for all ocean data sets in it via ODIS? Nov 14, 2024
@pbuttigieg
Copy link
Collaborator

the JSON

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:regex",
    "repeated": "cr:repeated",
    "replace": "cr:replace",
    "sc": "https://schema.org/",
    "separator": "cr:separator",
    "source": "cr:source",
    "subField": "cr:subField",
    "transform": "cr:transform",
    "wd": "https://www.wikidata.org/wiki/"
  },
  "@type": "sc:Dataset",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "name": "Ocean Heat Content",
  "url": "https://doi.org/10.7910/DVN/CAGYQL",
  "creator": [
    {
      "@type": "Person",
      "givenName": "Gael",
      "familyName": "Forget",
      "affiliation": {
        "@type": "Organization",
        "name": "Massachusetts Institute of Technology"
      },
      "name": "Forget, Gael"
    }
  ],
  "description": "Estimates (OCCA2, ECCO4) of global ocean heat content (OHC) anomaly from 2004-2006 climatology. ECCO4 is a closed heat budget estimate. ECCO4 release 5 is used here that covers 1992-2019. OCCA2 was derived by 1. extending ECCO4 (r2) to 1980-2022 and 2. adding a gridded adjustment to Argo over 2004-2022. The 2004-2006 climatologies were subtracted separately before combining anomalies over 1992-2019.",
  "keywords": [
    "Earth and Environmental Sciences",
    "ocean",
    "climate",
    "warming"
  ],
  "license": "http://creativecommons.org/publicdomain/zero/1.0",
  "datePublished": "2024-03-07",
  "dateModified": "2024-03-08",
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "Harvard Dataverse",
    "url": "https://dataverse.harvard.edu"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Harvard Dataverse"
  },
  "version": "1.1",
  "citeAs": "@data{DVN/CAGYQL_2024,author = {Forget, Gael},publisher = {Harvard Dataverse},title = {Ocean Heat Content},year = {2024},url = {https://doi.org/10.7910/DVN/CAGYQL}}",
  "citation": [
    {
      "@type": "CreativeWork",
      "name": "Forget, G.: Energy Imbalance in the Sunlit Ocean Layer (submitted)"
    }
  ],
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    },
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "encodingFormat": "image/png",
      "md5": "81dbe65ed124c315ab7db4b0bf680186",
      "contentSize": "39385",
      "description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
    }
  ]
}

@pbuttigieg
Copy link
Collaborator

pbuttigieg commented Nov 14, 2024

The Croissant semantics break interoperability at the moment, with not too much gain. But most of it is immediately useful .

@pbuttigieg
Copy link
Collaborator

@gaelforget I'll generate some suggestions for improved metadata based on the example above.

in the meantime, setting up the Node (even with the current form of metadata ) can begin following https://book.odis.org/gettingStarted.html

I'd set up a dedicated sitemap for ocean-related content (of any kind, socio-economic, physics, biological,...) and use that as the value of your ODIS-Arch URL in the ODISCat entry.

@pbuttigieg
Copy link
Collaborator

@fils this is an opportunity to figure out how to handle Croissant semantics and types in a smart way. I'm thinking using additionalType for non-sdo stuff. That would also allow Croissant properties in the stanzas

@pdurbin
Copy link

pdurbin commented Nov 14, 2024

@gaelforget hi! @atrisovic and I are at a conference but my first recommendation is to

Also, you're welcome to kick off a thread in our Zulip! https://dataverse.zulipchat.com

@pbuttigieg
Copy link
Collaborator

pbuttigieg commented Nov 18, 2024

I'll post a comment for each component that is currently preventing compatibility with existing schema.org systems. We'll start with the distribution property and its value space, which currently throws a validation error:

image

Distribution

Status quo

"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    },
    {
      "@type": "cr:FileObject",
      "@id": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "encodingFormat": "image/png",
      "md5": "81dbe65ed124c315ab7db4b0bf680186",
      "contentSize": "39385",
      "description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
    }
  ]

Proposed change

  • Restores the expected DataDownload type in the distribution value space, and thus the validator doesn't complain
  • Retains Croissant metadata and typing using additionalType. As an alternative, you can also include the Croissant type in an array, alongside DataDownload (see below).
  • removes @ids which don't resolve to a JSON node

Additional changes that may be useful:

  • put in a unit - KB, MB - for contentSize - the schema.org definition is ambiguous "File size in (mega/kilo)bytes."
  • If reference node @ids are to be included, ensure they point to either a JSON-LD file or something that deliver one (e.g. an embed in HTML).
"distribution": [
    {
      "@type": "DataDownload",
      "additionalType": "cr:FileObject",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    },
    {
      "@type": "DataDownload",
      "additionalType": "cr:FileObject",
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.png",
      "encodingFormat": "image/png",
      "md5": "81dbe65ed124c315ab7db4b0bf680186",
      "contentSize": "39385",
      "description": "Visualization of global OHC anomaly, computed from 2004-2006 climatology, for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954363"
    }
  ]

The alternative using an array for types:

"distribution": [
    {
      "@type": ["DataDownload", "cr:FileObject"],
      "name": "OCCA2_ECCO4_global_OHC_anomaly_1992_2019.nc",
      "encodingFormat": "application/x-netcdf",
      "md5": "6578a2fa4f30bdb277b8b4581de9bb6b",
      "contentSize": "14705",
      "description": "Global ocean heat anomalies, in ZJoule, computed from 2004-2006 climatology for OCCA2 (release 1) and ECCO4 (release 5)",
      "contentUrl": "https://dataverse.harvard.edu/api/access/datafile/8954362"
    }

Verify validation

image

@pbuttigieg
Copy link
Collaborator

pbuttigieg commented Nov 18, 2024

@gaelforget

The change to distribution described above fixes the validation errors and - in principle - should make the record fine for discoverability in ODIS. What we'll then need is a sitemap pointing to all the records you wish to share over ODIS, and your registration in OceanExpert and ODISCat. All described here.

That being said, it seems Croissant semantics are introducing some "noise" in addition to their very useful extensions of the base schema.org context. As mentioned, we'll likely write some guidance on how to best merge the two, without duplication / reinvention of things that vanilla schema.org already does.

@pdurbin
Copy link

pdurbin commented Nov 18, 2024

We'll start with the distribution property and its value space, which currently throws a validation error

This is a known issue, seeing http://mlcommons.org/croissant/FileObject is not a known valid target type for the distribution property as a validation error. Please see this issue:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants