Skip to content

Commit

Permalink
Merge pull request IQSS#10533 from IQSS/10341-croissant
Browse files Browse the repository at this point in the history
make Croissant support official, update exporter docs
  • Loading branch information
scolapasta authored Jul 31, 2024
2 parents c39ac88 + b5e7401 commit 28665c8
Show file tree
Hide file tree
Showing 8 changed files with 115 additions and 32 deletions.
9 changes: 9 additions & 0 deletions doc/release-notes/10341-croissant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
A new metadata export format called Croissant is now available as an external metadata exporter. It is oriented toward making datasets consumable by machine learning.

When enabled, Croissant replaces the Schema.org JSON-LD format in the `<head>` of dataset landing pages. For details, see the [Schema.org JSON-LD/Croissant Metadata](https://dataverse-guide--10533.org.readthedocs.build/en/10533/admin/discoverability.html#schema-org-head) under the discoverability section of the Admin Guide.

For more about the Croissant exporter, see https://github.com/gdcc/exporter-croissant

For installation instructions, see [Enabling External Exporters](https://dataverse-guide--10533.org.readthedocs.build/en/10533/installation/advanced.html#enabling-external-exporters) in the Installation Guide.

See also Issue #10341 and PR #10533.
13 changes: 10 additions & 3 deletions doc/sphinx-guides/source/admin/discoverability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,21 @@ The HTML source of a dataset landing page includes "DC" (Dublin Core) ``<meta>``
<meta name="DC.type" content="Dataset"
<meta name="DC.title" content="..."

Schema.org JSON-LD Metadata
+++++++++++++++++++++++++++
.. _schema.org-head:

The HTML source of a dataset landing page includes Schema.org JSON-LD metadata like this::
Schema.org JSON-LD/Croissant Metadata
+++++++++++++++++++++++++++++++++++++

The ``<head>`` of the HTML source of a dataset landing page includes Schema.org JSON-LD metadata like this::


<script type="application/ld+json">{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/...

If you enable the Croissant metadata export format (see :ref:`external-exporters`) the ``<head>`` will show Croissant metadata instead. It looks similar, but you should see ``"cr": "http://mlcommons.org/croissant/"`` in the output.

For backward compatibility, if you enable Croissant, the older Schema.org JSON-LD format (``schema.org`` in the API) will still be available from both the web interface (see :ref:`metadata-export-formats`) and the API (see :ref:`export-dataset-metadata-api`).

The Dataverse team has been working with Google on both formats. Google has `indicated <https://github.com/mlcommons/croissant/issues/530#issuecomment-1964227662>`_ that for `Google Dataset Search <https://datasetsearch.research.google.com>`_ (the main reason we started adding this extra metadata in the ``<head>`` of dataset pages), Croissant is the successor to the older format.

.. _discovery-sign-posting:

Expand Down
7 changes: 5 additions & 2 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1218,16 +1218,19 @@ The fully expanded example above (without environment variables) looks like this
.. note:: Supported exporters (export formats) are ``ddi``, ``oai_ddi``, ``dcterms``, ``oai_dc``, ``schema.org`` , ``OAI_ORE`` , ``Datacite``, ``oai_datacite`` and ``dataverse_json``. Descriptive names can be found under :ref:`metadata-export-formats` in the User Guide.

.. note:: Additional exporters can be enabled, as described under :ref:`external-exporters` in the Installation Guide. To discover the machine-readable name of each exporter (e.g. ``ddi``), check :ref:`inventory-of-external-exporters` or ``getFormatName`` in the exporter's source code.

Schema.org JSON-LD
^^^^^^^^^^^^^^^^^^

Please note that the ``schema.org`` format has changed in backwards-incompatible ways after Dataverse Software version 4.9.4:
Please note that the ``schema.org`` format has changed in backwards-incompatible ways after Dataverse 4.9.4:

- "description" was a single string and now it is an array of strings.
- "citation" was an array of strings and now it is an array of objects.

Both forms are valid according to Google's Structured Data Testing Tool at https://search.google.com/structured-data/testing-tool . (This tool will report "The property affiliation is not recognized by Google for an object of type Thing" and this known issue is being tracked at https://github.com/IQSS/dataverse/issues/5029 .) Schema.org JSON-LD is an evolving standard that permits a great deal of flexibility. For example, https://schema.org/docs/gs.html#schemaorg_expected indicates that even when objects are expected, it's ok to just use text. As with all metadata export formats, we will try to keep the Schema.org JSON-LD format your Dataverse installation emits backward-compatible to made integrations more stable, despite the flexibility that's afforded by the standard.
Both forms are valid according to Google's Structured Data Testing Tool at https://search.google.com/structured-data/testing-tool . Schema.org JSON-LD is an evolving standard that permits a great deal of flexibility. For example, https://schema.org/docs/gs.html#schemaorg_expected indicates that even when objects are expected, it's ok to just use text. As with all metadata export formats, we will try to keep the Schema.org JSON-LD format backward-compatible to make integrations more stable, despite the flexibility that's afforded by the standard.

The standard has further evolved into a format called Croissant. For details, see :ref:`schema.org-head` in the Admin Guide.

List Files in a Dataset
~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
55 changes: 54 additions & 1 deletion doc/sphinx-guides/source/developers/making-library-releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ These examples from the SWORD library. Below is what to expect from the interact
What is the new development version for "SWORD v2 Common Server Library (forked)"? (sword2-server) 2.0.1-SNAPSHOT: :
[INFO] 8/17 prepare:rewrite-poms-for-release
Note that a commit or two will be made and pushed but if you do a ``git status`` you will see that locally you are behind by that number of commits. To fix this, you can just do a ``git pull``.

It can take some time for the jar to be visible on Maven Central. You can start by looking on the repo1 server, like this: https://repo1.maven.org/maven2/io/gdcc/sword2-server/2.0.0/

Don't bother putting the new version in a pom.xml until you see it on repo1.
Expand All @@ -80,14 +82,65 @@ Releasing a New Library to Maven Central

At a high level:

- Start with a snapshot release.
- Use an existing pom.xml as a starting point.
- Use existing GitHub Actions workflows as a starting point.
- Create secrets in the new library's GitHub repo used by the workflow.
- If you need an entire new namespace, look at previous issues such as https://issues.sonatype.org/browse/OSSRH-94575 and https://issues.sonatype.org/browse/OSSRH-94577

Updating pom.xml for a Snapshot Release
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before publishing a final version to Maven Central, you should publish a snapshot release or two. For each snapshot release you publish, the jar name will be unique each time (e.g. ``foobar-0.0.1-20240430.175110-3.jar``), so you can safely publish over and over with the same version number.

We use the `Nexus Staging Maven Plugin <https://github.com/sonatype/nexus-maven-plugins/blob/main/staging/maven-plugin/README.md>`_ to push snapshot releases to https://s01.oss.sonatype.org/content/groups/staging/io/gdcc/ and https://s01.oss.sonatype.org/content/groups/staging/org/dataverse/

Add the following to your pom.xml:

.. code-block:: xml
<version>0.0.1-SNAPSHOT</version>
<distributionManagement>
<snapshotRepository>
<id>ossrh</id>
<url>https://s01.oss.sonatype.org/content/repositories/snapshots</url>
</snapshotRepository>
<repository>
<id>ossrh</id>
<url>https://s01.oss.sonatype.org/service/local/staging/deploy/maven2/</url>
</repository>
</distributionManagement>
<plugin>
<groupId>org.sonatype.plugins</groupId>
<artifactId>nexus-staging-maven-plugin</artifactId>
<version>${nexus-staging.version}</version>
<extensions>true</extensions>
<configuration>
<serverId>ossrh</serverId>
<nexusUrl>https://s01.oss.sonatype.org</nexusUrl>
<autoReleaseAfterClose>true</autoReleaseAfterClose>
</configuration>
</plugin>
Configuring Secrets
~~~~~~~~~~~~~~~~~~~

In GitHub, you will likely need to configure the following secrets:

- DATAVERSEBOT_GPG_KEY
- DATAVERSEBOT_GPG_PASSWORD
- DATAVERSEBOT_SONATYPE_TOKEN
- DATAVERSEBOT_SONATYPE_USERNAME

Note that some of these secrets might be configured at the org level (e.g. gdcc or IQSS).

Many of the automated tasks are performed by the dataversebot account on GitHub: https://github.com/dataversebot

npm (JavaScript/TypeScript)
---------------------------

Currently, publishing `@iqss/dataverse-design-system <https://www.npmjs.com/package/@iqss/dataverse-design-system>`_ to npm done manually. We plan to automate this as part of https://github.com/IQSS/dataverse-frontend/issues/140

https://www.npmjs.com/package/js-dataverse is the previous 1.0 version of js-dataverse. No 1.x releases are planned. We plan to publish 2.0 (used by the new frontend) as discussed in https://github.com/IQSS/dataverse-frontend/issues/13
https://www.npmjs.com/package/js-dataverse is the previous 1.0 version of js-dataverse. No 1.x releases are planned. We plan to publish 2.0 (used by the new frontend) as discussed in https://github.com/IQSS/dataverse-frontend/issues/13
9 changes: 6 additions & 3 deletions doc/sphinx-guides/source/developers/metadataexport.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,11 @@ Dataverse instances.
As of v5.14, Dataverse provides a mechanism for third-party developers to create new metadata Exporters than implement
new metadata formats or that replace existing formats. All the necessary dependencies are packaged in an interface JAR file
available from Maven Central. Developers can distribute their new Exporters as JAR files which can be dynamically loaded
into Dataverse instances - see :ref:`external-exporters`. Developers are encouraged to make their Exporter code available
via https://github.com/gdcc/dataverse-exporters (or minimally, to list their existence in the README there).
into Dataverse instances - see :ref:`external-exporters`. Developers are encouraged to work with the core Dataverse team
(see :ref:`getting-help-developers`) to distribute these JAR files via Maven Central. See the
`Croissant <https://central.sonatype.com/artifact/io.gdcc.export/croissant>`_ and
`Debug <https://central.sonatype.com/artifact/io.gdcc.export/debug>`_ artifacts as examples. You may find other examples
under :ref:`inventory-of-external-exporters` in the Installation Guide.

Exporter Basics
---------------
Expand Down Expand Up @@ -63,7 +66,7 @@ If an Exporter cannot create a requested metadata format for some reason, it sho
Building an Exporter
--------------------

The example at https://github.com/gdcc/dataverse-exporters provides a Maven pom.xml file suitable for building an Exporter JAR file and that repository provides additional development guidance.
The examples at https://github.com/gdcc/exporter-croissant and https://github.com/gdcc/exporter-debug provide a Maven pom.xml file suitable for building an Exporter JAR file and those repositories provide additional development guidance.

There are four dependencies needed to build an Exporter:

Expand Down
35 changes: 17 additions & 18 deletions doc/sphinx-guides/source/installation/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,27 +119,26 @@ To activate in your Dataverse installation::

.. _external-exporters:

Installing External Metadata Exporters
++++++++++++++++++++++++++++++++++++++
External Metadata Exporters
+++++++++++++++++++++++++++

As of Dataverse Software 5.14 Dataverse supports the use of external Exporters as a way to add additional metadata
export formats to Dataverse or replace the built-in formats. This should be considered an **experimental** capability
in that the mechanism is expected to evolve and using it may require additional effort when upgrading to new Dataverse
versions.
Dataverse 5.14+ supports the configuration of external metadata exporters (just "external exporters" or "exporters" for short) as a way to add additional metadata export formats or replace built-in formats. For a list of built-in formats, see :ref:`metadata-export-formats` in the User Guide.

This capability is enabled by specifying a directory in which Dataverse should look for third-party Exporters. See
:ref:`dataverse.spi.exporters.directory`.
This should be considered an **experimental** capability in that the mechanism is expected to evolve and using it may require additional effort when upgrading to new Dataverse versions.

See :doc:`/developers/metadataexport` for details about how to develop new Exporters.
Enabling External Exporters
^^^^^^^^^^^^^^^^^^^^^^^^^^^

An minimal example Exporter is available at https://github.com/gdcc/dataverse-exporters. The community is encourage to
add additional exporters (and/or links to exporters elsewhere) in this repository. Once you have downloaded the
dataverse-spi-export-examples-1.0.0.jar (or other exporter jar), installed it in the directory specified above, and
restarted your Payara server, the new exporter should be available.
Use the :ref:`dataverse.spi.exporters.directory` configuration option to specify a directory from which external exporters (JAR files) should be loaded.

The example dataverse-spi-export-examples-1.0.0.jar replaces the ``JSON`` export with a ``MyJSON in <locale>`` version
that just wraps the existing JSON export object in a new JSON object with the key ``inputJson`` containing the original
JSON.(Note that the ``MyJSON in <locale>`` label will appear in the dataset Metadata Export download menu immediately,
but the content for already published datasets will only be updated after you delete the cached exports and/or use a
reExport API call (see :ref:`batch-exports-through-the-api`).)
.. _inventory-of-external-exporters:

Inventory of External Exporters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a list of external exporters, see the README at https://github.com/gdcc/dataverse-exporters

Developing New Exporters
^^^^^^^^^^^^^^^^^^^^^^^^

See :doc:`/developers/metadataexport` for details about how to develop new exporters.
15 changes: 11 additions & 4 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3190,12 +3190,19 @@ Can also be set via any `supported MicroProfile Config API source`_, e.g. the en
dataverse.spi.exporters.directory
+++++++++++++++++++++++++++++++++

This JVM option is used to configure the file system path where external Exporter JARs can be placed. See :ref:`external-exporters` for more information.
For some background, see :ref:`external-exporters` and :ref:`inventory-of-external-exporters`.

``./asadmin create-jvm-options '-Ddataverse.spi.exporters.directory=PATH_LOCATION_HERE'``
This JVM option is used to configure the file system path where external exporter JARs should be loaded from. For example:

If this value is set, Dataverse will examine all JARs in the specified directory and will use them to add, or replace existing, metadata export formats.
If this value is not set (the default), Dataverse will not use external Exporters.
``./asadmin create-jvm-options '-Ddataverse.spi.exporters.directory=/var/lib/dataverse/exporters'``

If this value is set, Dataverse will examine all JARs in the specified directory and will use them to add new metadata export formats or (if the machine-readable name used in :ref:`export-dataset-metadata-api` is the same) replace built-in metatadata export formats.

If this value is not set (the default), Dataverse will not load any external exporters.

If you place a new JAR in this directory, you must restart Payara for Dataverse to load it.

If the JAR is for an exporter that replaces built-in format, you must delete the cached exports and/or use a reExport API call (see :ref:`batch-exports-through-the-api`) for the new format to be visible for existing datasets.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SPI_EXPORTERS_DIRECTORY``.

Expand Down
4 changes: 3 additions & 1 deletion doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ For more details about what Citation and Domain Specific Metadata is supported p
Supported Metadata Export Formats
---------------------------------

Once a dataset has been published, its metadata can be exported in a variety of other metadata standards and formats, which help make datasets more discoverable and usable in other systems, such as other data repositories. On each dataset page's metadata tab, the following exports are available:
Once a dataset has been published, its metadata can be exported in a variety of other metadata standards and formats, which help make datasets more :doc:`discoverable </admin/discoverability>` and usable in other systems, such as other data repositories. On each dataset page's metadata tab, the following exports are available:

- Dublin Core
- DDI (Data Documentation Initiative Codebook 2.5)
Expand All @@ -36,6 +36,8 @@ Once a dataset has been published, its metadata can be exported in a variety of
- OpenAIRE
- Schema.org JSON-LD

Additional formats can be enabled. See :ref:`inventory-of-external-exporters` in the Installation Guide.

Each of these metadata exports contains the metadata of the most recently published version of the dataset.

.. _adding-new-dataset:
Expand Down

0 comments on commit 28665c8

Please sign in to comment.