Skip to content

Commit 7a29522

Browse files
authored
Merge pull request IQSS#11217 from Recherche-Data-Gouv/10217-source-name-harvesting-client
10217 source name harvesting client
2 parents ab8110f + f7c4c42 commit 7a29522

File tree

15 files changed

+198
-125
lines changed

15 files changed

+198
-125
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### Metadata Source Facet Can Now Differentiate Between Harvested Sources
2+
3+
The behavior of the feature flag `index-harvested-metadata-source` and the "Metadata Source" facet, which were added and updated, respectively, in [Dataverse 6.3](https://github.com/IQSS/dataverse/releases/tag/v6.3) (through pull requests #10464 and #10651), have been updated. A new field called "Source Name" has been added to harvesting clients.
4+
5+
Before Dataverse 6.3, all harvested content (datasets and files) appeared together under "Harvested" under the "Metadata Source" facet. This is still the behavior of Dataverse out of the box. Since Dataverse 6.3, enabling the `index-harvested-metadata-source` feature flag (and reindexing) resulted in harvested content appearing under the nickname for whatever harvesting client was used to bring in the content. This meant that instead of having all harvested content lumped together under "Harvested", content would appear under "client1", "client2", etc.
6+
7+
Now, as this release, enabling the `index-harvested-metadata-source` feature flag, populating a new field for harvesting clients called "Source Name" ("sourceName" in the [API](https://dataverse-guide--11217.org.readthedocs.build/en/11217/api/native-api.html#create-a-harvesting-client)), and reindexing (see upgrade instructions below), results in the source name appearing under the "Metadata Source" facet rather than the harvesting client nickname. This gives you more control over the name that appears under the "Metadata Source" facet and allows you to group harvested content from various harvesting clients under the same name if you wish (by reusing the same source name).
8+
9+
Previously, `index-harvested-metadata-source` was not documented in the guides, but now you can find information about it under [Feature Flags](https://dataverse-guide--11217.org.readthedocs.build/en/11217/installation/config.html#feature-flags). See also #10217 and #11217.
10+
11+
## Upgrade instructions
12+
13+
If you have enabled the `dataverse.feature.index-harvested-metadata-source` feature flag and given some of your harvesting clients a source name, you should reindex to have those source names appear under the "Metadata Source" facet.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"nickName": "zenodo",
3+
"dataverseAlias": "zenodoHarvested",
4+
"harvestUrl": "https://zenodo.org/oai2d",
5+
"archiveUrl": "https://zenodo.org",
6+
"archiveDescription": "Moissonné depuis la collection LMOPS de l'entrepôt Zenodo. En cliquant sur ce jeu de données, vous serez redirigé vers Zenodo.",
7+
"metadataFormat": "oai_dc",
8+
"customHeaders": "x-oai-api-key: xxxyyyzzz",
9+
"set": "user-lmops",
10+
"allowHarvestingMissingCVV":true
11+
}

doc/sphinx-guides/source/api/native-api.rst

+57-33
Original file line numberDiff line numberDiff line change
@@ -5556,7 +5556,7 @@ Create a Harvesting Set
55565556
55575557
To create a harvesting set you must supply a JSON file that contains the following fields:
55585558
5559-
- Name: Alpha-numeric may also contain -, _, or %, but no spaces. Must also be unique in the installation.
5559+
- Name: Alpha-numeric may also contain -, _, or %, but no spaces. It must also be unique in the installation.
55605560
- Definition: A search query to select the datasets to be harvested. For example, a query containing authorName:YYY would include all datasets where ‘YYY’ is the authorName.
55615561
- Description: Text that describes the harvesting set. The description appears in the Manage Harvesting Sets dashboard and in API responses. This field is optional.
55625562
@@ -5652,20 +5652,43 @@ The following API can be used to create and manage "Harvesting Clients". A Harve
56525652
List All Configured Harvesting Clients
56535653
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56545654
5655-
Shows all the Harvesting Clients configured::
5655+
Shows all the harvesting clients configured.
56565656
5657-
GET http://$SERVER/api/harvest/clients/
5657+
.. note:: See :ref:`curl-examples-and-environment-variables` if you are unfamiliar with the use of export below.
5658+
5659+
.. code-block:: bash
5660+
5661+
export SERVER_URL=https://demo.dataverse.org
5662+
5663+
curl "$SERVER_URL/api/harvest/clients"
5664+
5665+
The fully expanded example above (without the environment variables) looks like this:
5666+
5667+
.. code-block:: bash
5668+
5669+
curl "https://demo.dataverse.org/api/harvest/clients"
56585670
56595671
Show a Specific Harvesting Client
56605672
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56615673
5662-
Shows a Harvesting Client with a defined nickname::
5674+
Shows a harvesting client by nickname.
56635675
5664-
GET http://$SERVER/api/harvest/clients/$nickname
5676+
.. code-block:: bash
5677+
5678+
export SERVER_URL=https://demo.dataverse.org
5679+
export NICKNAME=myclient
5680+
5681+
curl "$SERVER_URL/api/harvest/clients/$NICKNAME"
5682+
5683+
The fully expanded example above (without the environment variables) looks like this:
56655684
56665685
.. code-block:: bash
56675686
5668-
curl "http://localhost:8080/api/harvest/clients/myclient"
5687+
curl "https://demo.dataverse.org/api/harvest/clients/myclient"
5688+
5689+
The output will look something like the following.
5690+
5691+
.. code-block:: bash
56695692
56705693
{
56715694
"status":"OK",
@@ -5681,6 +5704,7 @@ Shows a Harvesting Client with a defined nickname::
56815704
"type": "oai",
56825705
"dataverseAlias": "fooData",
56835706
"nickName": "myClient",
5707+
"sourceName": "",
56845708
"set": "fooSet",
56855709
"useOaiIdentifiersAsPids": false
56865710
"schedule": "none",
@@ -5694,16 +5718,12 @@ Shows a Harvesting Client with a defined nickname::
56945718
}
56955719
56965720
5721+
.. _create-a-harvesting-client:
5722+
56975723
Create a Harvesting Client
56985724
~~~~~~~~~~~~~~~~~~~~~~~~~~
5699-
5700-
To create a new harvesting client::
5701-
5702-
POST http://$SERVER/api/harvest/clients/$nickname
5703-
5704-
``nickName`` is the name identifying the new client. It should be alpha-numeric and may also contain -, _, or %, but no spaces. Must also be unique in the installation.
57055725
5706-
You must supply a JSON file that describes the configuration, similarly to the output of the GET API above. The following fields are mandatory:
5726+
To create a harvesting client you must supply a JSON file that describes the configuration, similarly to the output of the GET API above. The following fields are mandatory:
57075727
57085728
- dataverseAlias: The alias of an existing collection where harvested datasets will be deposited
57095729
- harvestUrl: The URL of the remote OAI archive
@@ -5712,6 +5732,7 @@ You must supply a JSON file that describes the configuration, similarly to the o
57125732
57135733
The following optional fields are supported:
57145734
5735+
- sourceName: When ``index-harvested-metadata-source`` is enabled (see :ref:`feature-flags`), sourceName will override the nickname in the Metadata Source facet. It can be used to group the content from many harvesting clients under the same name.
57155736
- archiveDescription: What the name suggests. If not supplied, will default to "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data."
57165737
- set: The OAI set on the remote server. If not supplied, will default to none, i.e., "harvest everything".
57175738
- style: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
@@ -5720,38 +5741,35 @@ The following optional fields are supported:
57205741
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
57215742
57225743
Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored. For example, as of writing this there is no way to configure a harvesting schedule via this API.
5723-
5724-
An example JSON file would look like this::
57255744
5726-
{
5727-
"nickName": "zenodo",
5728-
"dataverseAlias": "zenodoHarvested",
5729-
"harvestUrl": "https://zenodo.org/oai2d",
5730-
"archiveUrl": "https://zenodo.org",
5731-
"archiveDescription": "Moissonné depuis la collection LMOPS de l'entrepôt Zenodo. En cliquant sur ce jeu de données, vous serez redirigé vers Zenodo.",
5732-
"metadataFormat": "oai_dc",
5733-
"customHeaders": "x-oai-api-key: xxxyyyzzz",
5734-
"set": "user-lmops",
5735-
"allowHarvestingMissingCVV":true
5736-
}
5745+
You can download this :download:`harvesting-client.json <../_static/api/harvesting-client.json>` file to use as a starting point.
57375746
5738-
Something important to keep in mind about this API is that, unlike the harvesting clients GUI, it will create a client with the values supplied without making any attempts to validate them in real time. In other words, for the `harvestUrl` it will accept anything that looks like a well-formed url, without making any OAI calls to verify that the name of the set and/or the metadata format entered are supported by it. This is by design, to give an admin an option to still be able to create a client, in a rare case when it cannot be done via the GUI because of some real time failures in an exchange with an otherwise valid OAI server. This however puts the responsibility on the admin to supply the values already confirmed to be valid.
5747+
.. literalinclude:: ../_static/api/harvesting-client.json
57395748
5749+
Something important to keep in mind about this API is that, unlike the harvesting clients GUI, it will create a client with the values supplied without making any attempts to validate them in real time. In other words, for the `harvestUrl` it will accept anything that looks like a well-formed url, without making any OAI calls to verify that the name of the set and/or the metadata format entered are supported by it. This is by design, to give an admin an option to still be able to create a client, in a rare case when it cannot be done via the GUI because of some real time failures in an exchange with an otherwise valid OAI server. This however puts the responsibility on the admin to supply the values already confirmed to be valid.
57405750
57415751
.. note:: See :ref:`curl-examples-and-environment-variables` if you are unfamiliar with the use of export below.
57425752
5753+
5754+
``nickName`` in the JSON file and ``$NICKNAME`` in the URL path below is the name identifying the new client. It should be alpha-numeric and may also contain -, _, or %, but no spaces. It must be unique in the installation.
5755+
57435756
.. code-block:: bash
57445757
57455758
export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
57465759
export SERVER_URL=http://localhost:8080
5760+
export NICKNAME=zenodo
57475761
5748-
curl -H "X-Dataverse-key:$API_TOKEN" -X POST -H "Content-Type: application/json" "$SERVER_URL/api/harvest/clients/zenodo" --upload-file client.json
5762+
curl -H "X-Dataverse-key:$API_TOKEN" -X POST -H "Content-Type: application/json" "$SERVER_URL/api/harvest/clients/$NICKNAME" --upload-file harvesting-client.json
57495763
57505764
The fully expanded example above (without the environment variables) looks like this:
57515765
57525766
.. code-block:: bash
57535767
5754-
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST -H "Content-Type: application/json" "http://localhost:8080/api/harvest/clients/zenodo" --upload-file "client.json"
5768+
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST -H "Content-Type: application/json" "http://localhost:8080/api/harvest/clients/zenodo" --upload-file "harvesting-client.json"
5769+
5770+
The output will look something like the following.
5771+
5772+
.. code-block:: bash
57555773
57565774
{
57575775
"status": "OK",
@@ -5785,15 +5803,21 @@ Similar to the API above, using the same JSON format, but run on an existing cli
57855803
Delete a Harvesting Client
57865804
~~~~~~~~~~~~~~~~~~~~~~~~~~
57875805
5788-
Self-explanatory:
5789-
57905806
.. code-block:: bash
57915807
5792-
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X DELETE "http://localhost:8080/api/harvest/clients/$nickName"
5808+
export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
5809+
export SERVER_URL=http://localhost:8080
5810+
export NICKNAME=zenodo
57935811
5794-
Only users with superuser permissions may delete harvesting clients.
5812+
curl -H "X-Dataverse-key:$API_TOKEN" -X DELETE "$SERVER_URL/api/harvest/clients/$NICKNAME"
57955813
5814+
The fully expanded example above (without the environment variables) looks like this:
5815+
5816+
.. code-block:: bash
57965817
5818+
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X DELETE "http://localhost:8080/api/harvest/clients/zenodo"
5819+
5820+
Only users with superuser permissions may delete harvesting clients.
57975821
57985822
.. _pids-api:
57995823

doc/sphinx-guides/source/installation/config.rst

+3
Original file line numberDiff line numberDiff line change
@@ -3493,6 +3493,9 @@ please find all known feature flags below. Any of these flags can be activated u
34933493
* - globus-use-experimental-async-framework
34943494
- Activates a new experimental implementation of Globus polling of ongoing remote data transfers that does not rely on the instance staying up continuously for the duration of the transfers and saves the state information about Globus upload requests in the database. Added in v6.4. Affects :ref:`:GlobusPollingInterval`. Note that the JVM option :ref:`dataverse.files.globus-monitoring-server` described above must also be enabled on one (and only one, in a multi-node installation) Dataverse instance.
34953495
- ``Off``
3496+
* - index-harvested-metadata-source
3497+
- Index the nickname or the source name (See the optional ``sourceName`` field in :ref:`create-a-harvesting-client`) of the harvesting client as the "metadata source" of harvested datasets and files. If enabled, the Metadata Source facet will show separate groupings of the content harvested from different sources (by harvesting client nickname or source name) instead of the default behavior where there is one "Harvested" grouping for all harvested content.
3498+
- ``Off``
34963499

34973500
**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
34983501
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.

docker-compose-dev.yml

+1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ services:
1717
SKIP_DEPLOY: "${SKIP_DEPLOY}"
1818
DATAVERSE_JSF_REFRESH_PERIOD: "1"
1919
DATAVERSE_FEATURE_API_BEARER_AUTH: "1"
20+
DATAVERSE_FEATURE_INDEX_HARVESTED_METADATA_SOURCE: "1"
2021
DATAVERSE_FEATURE_API_BEARER_AUTH_PROVIDE_MISSING_CLAIMS: "1"
2122
DATAVERSE_MAIL_SYSTEM_EMAIL: "dataverse@localhost"
2223
DATAVERSE_MAIL_MTA_HOST: "smtp"

0 commit comments

Comments
 (0)