Skip to content

Commit

Permalink
Merge pull request #3330 from IQSS/develop
Browse files Browse the repository at this point in the history
Merge 4.5 into Master
  • Loading branch information
djbrooke authored Sep 2, 2016
2 parents 93597b4 + dc58ae1 commit 067201c
Show file tree
Hide file tree
Showing 220 changed files with 31,650 additions and 1,306 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ scripts/api/py_api_wrapper/local-data/*
doc/sphinx-guides/build
faces-config.NavData
src/main/java/BuildNumber.properties
/nbproject/
5 changes: 5 additions & 0 deletions Vagrantfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
puts "OPERATING_SYSTEM environment variable not specified. Using #{operating_system} by default.\nTo specify it in bash: export OPERATING_SYSTEM=debian"
config.vm.box_url = "http://puppet-vagrant-boxes.puppetlabs.com/centos-65-x64-virtualbox-puppet.box"
config.vm.box = "puppet-vagrant-boxes.puppetlabs.com-centos-65-x64-virtualbox-puppet.box"
elsif ENV['OPERATING_SYSTEM'] == 'centos7'
puts "WARNING: CentOS 7 specified. Newer than what the dev team tests on."
config.vm.box_url = "https://atlas.hashicorp.com/puppetlabs/boxes/centos-7.2-64-puppet/versions/1.0.1/providers/virtualbox.box"
config.vm.box = "puppetlabs-centos-7.2-64-puppet-1.0.1-virtualbox.box"
standalone.vm.box = "puppetlabs-centos-7.2-64-puppet-1.0.1-virtualbox.box"
elsif ENV['OPERATING_SYSTEM'] == 'debian'
puts "WARNING: Debian specified. Here be dragons! https://github.com/IQSS/dataverse/issues/1059"
config.vm.box_url = "http://puppet-vagrant-boxes.puppetlabs.com/debian-73-x64-virtualbox-puppet.box"
Expand Down
2 changes: 2 additions & 0 deletions conf/solr/4.6.0/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,8 @@
<field name="discoverableBy" type="string" stored="true" indexed="true" multiValued="true"/>

<field name="dvObjectType" type="string" stored="true" indexed="true" multiValued="false"/>
<field name="metadataSource" type="string" stored="true" indexed="true" multiValued="false"/>
<field name="isHarvested" type="boolean" stored="true" indexed="true" multiValued="false"/>

<field name="dvName" type="text_en" stored="true" indexed="true" multiValued="false"/>
<field name="dvAffiliation" type="text_en" stored="true" indexed="true" multiValued="false"/>
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ https://wiki.shibboleth.net/confluence/display/SHIB2/NativeSPConfiguration
-->

<SPConfig xmlns="urn:mace:shibboleth:2.0:native:sp:config" xmlns:md="urn:oasis:names:tc:SAML:2.0:metadata"
xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion"
clockSkew="1800">

<!-- FIXME: change the entityID to your hostname. -->
Expand Down Expand Up @@ -54,6 +55,23 @@ https://wiki.shibboleth.net/confluence/display/SHIB2/NativeSPConfiguration
<!-- Loads and trusts a metadata file that describes only the Testshib IdP and how to communicate with it. -->
<!-- IdPs we want allow go in /etc/shibboleth/dataverse-idp-metadata.xml -->
<MetadataProvider type="XML" file="dataverse-idp-metadata.xml" backingFilePath="local-idp-metadata.xml" legacyOrgNames="true" reloadInterval="7200"/>
<!-- Uncomment to enable all the Research & Scholarship IdPs from InCommon -->
<!--
<MetadataProvider type="XML" url="http://md.incommon.org/InCommon/InCommon-metadata.xml" backingFilePath="InCommon-metadata.xml" maxRefreshDelay="3600">
<DiscoveryFilter type="Whitelist" matcher="EntityAttributes">
<saml:Attribute
Name="http://macedir.org/entity-category-support"
NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:uri">
<saml:AttributeValue>http://id.incommon.org/category/research-and-scholarship</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute
Name="http://macedir.org/entity-category-support"
NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:uri">
<saml:AttributeValue>http://refeds.org/category/research-and-scholarship</saml:AttributeValue>
</saml:Attribute>
</DiscoveryFilter>
</MetadataProvider>
-->

<!-- Attribute and trust options you shouldn't need to change. -->
<AttributeExtractor type="XML" validate="true" path="attribute-map.xml"/>
Expand Down
Binary file not shown.
37 changes: 37 additions & 0 deletions doc/sphinx-guides/source/admin/harvestclients.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
Managing Harvesting Clients
===========================

.. contents:: :local:

Your Dataverse as a Metadata Harvester
--------------------------------------

Harvesting is a process of exchanging metadata with other repositories. As a harvesting *client*, your Dataverse can
gather metadata records from remote sources. These can be other Dataverse instances or other archives that support OAI-PMH, the standard harvesting protocol. Harvested metadata records will be indexed and made searchable by your users. Clicking on a harvested dataset in the search results takes the user to the original repository. Harvested datasets cannot be edited in your Dataverse installation.

Harvested records can be kept in sync with the original repository through scheduled incremental updates, daily or weekly.
Alternatively, harvests can be run on demand, by the Admin.

Managing Harvesting Clients
---------------------------

To start harvesting metadata from a remote OAI repository, you first create and configure a *Harvesting Client*.

Clients are managed on the "Harvesting Clients" page accessible via the Dashboard. Click on the *Add Client* button to get started.

The process of creating a new, or editing an existing client, is largely self-explanatory. It is split into logical steps, in a way that allows the user to go back and correct the entries made earlier. The process is interactive and guidance text is provided. For example, the user is required to enter the URL of the remote OAI server. When they click *Next*, the application will try to establish a connection to the server in order to verify that it is working, and to obtain the information about the sets of metadata records and the metadata formats it supports. The choices offered to the user on the next page will be based on this extra information. If the application fails to establish a connection to the remote archive at the address specified, or if an invalid response is received, the user is given an opportunity to check and correct the URL they entered.

New in Dataverse 4, vs. DVN 3
-----------------------------


- Note that when creating a client you will need to select an existing local dataverse to host the datasets harvested. In DVN 3, a dedicated "harvesting dataverse" would be created specifically for each remote harvesting source. In Dataverse 4, harvested content can be added to *any dataverse*. This means that a dataverse can now contain datasets harvested from multiple sources and/or a mix of local and harvested datasets.


- An extra "Archive Type" pull down menu is added to the Create and Edit dialogs. This setting, selected from the choices such as "Dataverse 4", "DVN, v2-3", "Generic OAI", etc. is used to properly format the harvested metadata as they are shown in the search results. It is **very important** to select the type that best describes this remote server, as failure to do so can result in information missing from the search results, and, a **failure to redirect the user to the archival source** of the data!

It is, however, **very easy to correct** a mistake like this. For example, let's say you have created a client to harvest from the XYZ Institute and specified the archive type as "Dataverse 4". You have been able to harvest content, the datasets appear in search result, but clicking on them results in a "Page Not Found" error on the remote site. At which point you realize that the XYZ Institute admins have not yet upgraded to Dataverse 4, still running DVN v3.1.2 instead. All you need to do is go back to the Harvesting Clients page, and change the setting to "DVN, v2-3". This will fix the redirects **without having to re-harvest** the datasets.

- Another extra entry, "Archive Description", is added to the *Edit Harvesting Client* dialog. This description appears at the bottom of each search result card for a harvested dataset or datafile. By default, this text reads "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data." Here it can be customized to be more descriptive, for example, "This Dataset is harvested from our partners at the XYZ Institute..."


130 changes: 130 additions & 0 deletions doc/sphinx-guides/source/admin/harvestserver.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
Managing Harvesting Server and Sets
===================================

.. contents:: :local:

Your Dataverse as an OAI server
-------------------------------

As a harvesting *server*, your Dataverse can make some of the local
dataset metadata available to remote harvesting clients. These can be
other Dataverse instances, or any other clients that support OAI-PMH
harvesting protocol. Note that the terms "Harvesting Server" and "OAI
Server" are being used interchangeably throughout this guide and in
the inline help text.

How does it work?
-----------------

Only the published, unrestricted datasets in your Dataverse can
be made harvestable. Remote clients normally keep their records in sync
through scheduled incremental updates, daily or weekly, thus
minimizing the load on your server. Note that it is only the metadata
that are harvested. Remote harvesters will generally not attempt to
download the data files associated with the harvested datasets.

Harvesting server can be enabled or disabled on the "Harvesting
Server" page accessible via the Dashboard. Harvesting server is by
default disabled on a brand new, "out of the box" Dataverse.

OAI Sets
--------

Once the service is enabled, you define collections of local datasets
that will be available to remote harvesters as *OAI Sets*. Once again,
the terms "OAI Set" and "Harvesting Set" are used
interchangeably. Sets are defined by search queries. Any such query
that finds any number of published, local (non-harvested) datasets can
be used to create an OAI set. Sets can overlap local dataverses, and
can include as few or as many of your local datasets as you wish. A
good way to master the Dataverse search query language is to
experiment with the Advanced Search page. We also recommend that you
consult the Search API section of the Dataverse User Guide.

Once you have entered the search query and clicked *Next*, the number
of search results found will be shown on the next screen. This way, if
you are seeing a number that's different from what you expected, you
can go back and try to re-define the query.

Some useful examples of search queries to define OAI sets:

- A good way to create a set that would include all your local, published datasets is to do so by the Unique Identifier authority registered to your Dataverse, for example:

``dsPersistentId:"doi:1234/"``

Note that double quotes must be used, since the search field value contains the colon symbol!

Note also that the search terms limiting the results to published and local datasets **are added to the query automatically**, so you don't need to worry about that.

- A query to create a set to include the datasets from a specific local dataverse:

``parentId:NNN``

where NNN is the database id of the dataverse object (consult the Dataverse table of the SQL database used by the application to verify the database id).

- A query to find all the dataset by a certain author:

``authorName:YYY``

where YYY is the name.

- Complex queries can be created with multiple logical AND and OR operators. For example,

``(authorName:YYY OR authorName:ZZZ) AND dsPublicationDate:NNNN``

- Some further query examples:

For specific datasets using a persistentID:

``(dsPersistentId:10.5000/ZZYYXX/ OR dsPersistentId:10.5000/XXYYZZ)``

For all datasets within a specific ID authority:

``dsPersistentId:10.5000/XXYYZZ``

For all dataverses with subjects of Astronomy and Astrophysics or Earth and Environmental Sciences:

``(dvSubject:"Astronomy and Astrophysics" OR dvSubject:"Earth and Environmental Sciences")``

For all datasets containing the keyword "censorship":

``keywordValue:censorship``

Important: New SOLR schema required!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to be able to define OAI sets, your SOLR server must be upgraded with the search schema that came with the Dataverse release 4.5 (or later), and all your local datasets must be re-indexed, once the new schema is installed.

OAI Set updates
---------------

Every time a new harvesting set is created, or changes are made to an
existing set, the contents of the set are automatically updated - the
Dataverse application will find the datasets defined by the query, and
attempt to run the metadata export on the ones that haven't been
exported yet. Only the datasets for which the export has completed
successfully, and the results cached on the filesystem are included in
the OAI sets advertised to the harvesting clients!

This is in contrast to how the sets used to be managed in DVN v.3,
where sets had to be exported manually before any such changes had
effect.

**Important:** Note however that changes made to the actual dataset
metadata do not automatically trigger any corresponding OAI sets to
be updated immediately! For example: let's say you have created an OAI set defined by
the search query ``authorName:king``, that resulted in 43
dataset records. If a new dataset by the same author is added and published, this **does not** immediately add the extra
record to the set! It would simply be too expensive, to refresh all
the sets every time any changes to the metadata are made.

The OAI set will however be updated automatically by a scheduled metadata export job that
runs every night (at 2AM, by default). This export timer is created
and activated automatically every time the application is deployed
or restarted. Once again, this is new in Dataverse 4, and unlike DVN
v3, where export jobs had to be scheduled and activated by the admin
user. See the "Export" section of the Admin guide, for more information on the automated metadata exports.

It is still possible however to make changes like this be immediately
reflected in the OAI server, by going to the *Harvesting Server* page
and clicking the "Run Export" icon next to the desired OAI set.
21 changes: 21 additions & 0 deletions doc/sphinx-guides/source/admin/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. Dataverse API Documentation master file, created by
sphinx-quickstart on Wed Aug 28 17:54:16 2013.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Admin Guide
===========

This guide documents the functionality only available to the Dataverse Admin ("Network Administrator") users. As of this release (4.5) such functionality includes managing Harvesting (client and server) and batch metadata export.

These "superuser" tasks are managed via the new page called the Dashboard. A user logged in as a Dataverse Admin will see the Dashboard link rendered in the upper right corner of every Dataverse page.

Contents:

.. toctree::
:maxdepth: 2

harvestclients
harvestserver
metadataexport
timers
30 changes: 30 additions & 0 deletions doc/sphinx-guides/source/admin/metadataexport.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Metadata Export
===============

.. contents:: :local:

Automatic Exports
-----------------

Unlike in DVN v3, publishing a dataset in Dataverse 4 automaticalliy starts a metadata export job, that will run in the background, asynchronously. Once completed, it will make the dataset metadata exported and cached in all the supported formats (Dublin Core, Data Documentation Initiative (DDI), and native JSON). There is no need to run the export manually.

A scheduled timer job that runs nightly will attempt to export any published datasets that for whatever reason haven't been exported yet. This timer is activated automatically on the deployment, or restart, of the application. So, again, no need to start or configure it manually. (See the "Application Timers" section of this guide for more information)

Batch exports through the API
-----------------------------

In addition to the automated exports, a Dataverse admin can start a batch job through the API. The following 2 API calls are provided:

/api/admin/metadata/exportAll

/api/admin/metadata/reExportAll

The former will attempt to export all the published, local (non-harvested) datasets that haven't been exported yet.
The latter will *force* a re-export of every published, local dataset, regardless of whether it has already been exported or not.

Note, that creating, modifying, or re-exporting an OAI set will also attempt to export all the unexported datasets found in the set.

Export Failures
---------------

An export batch job, whether started via the API, or by the application timer, will leave a detailed log in your configured logs directory. This is the same location where your main Glassfish server.log is found. The name of the log file is ``export_[timestamp].log`` - for example, *export_2016-08-23T03-35-23.log*. The log will contain the numbers of datasets processed successfully and those for which metadata export failed, with some information on the failures detected. Please attach this log file if you need to contact Dataverse support about metadata export problems.
43 changes: 43 additions & 0 deletions doc/sphinx-guides/source/admin/timers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Dataverse Application Timers
============================

.. contents:: :local:

Dataverse uses timers to automatically run scheduled Harvest and Metadata export jobs.


Dedicated timer server in a Dataverse server cluster
----------------------------------------------------

When running a Dataverse cluster - i.e. multiple Dataverse application
servers talking to the same database - **only one** of them must act
as the *dedicated timer server*. This is to avoid starting conflicting
batch jobs on multiple nodes at the same time.

This does not affect a single-server installation. So you can safely skip this section unless you are running a multi-server cluster.

The following JVM option instructs the application to act as the dedicated timer server:

``-Ddataverse.timerServer=true``

**IMPORTANT:** Note, that this option is automatically set by the Dataverse installer script. That means that when configuring a multi-server cluster, it will be the responsibility of the installer to remove the option from the domain.xml of every node except the one intended to be the timer server.

Harvesting Timers
-----------------

These timers are created when scheduled harvesting is enabled by a local admin user (via the "Manage Harvesting Clients" page).

In a multi-node cluster, all these timers will be created on the dedicated timer node (and not necessarily on the node where the harvesting clients was created and/or saved).

A timer will be automatically removed, when a harvesting client with an active schedule is deleted, or if the schedule is turned off for an existing client.

Metadata Export Timer
---------------------

This timer is created automatically whenever the application is deployed or restarted. There is no admin user-accessible configuration for this timer.

This timer runs a daily job that tries to export all the local, published datasets that haven't been exported yet, in all the supported metdata formats, and cache the results on the filesystem. (Note that, normally, an export will happen automatically whenever a dataset is published. So this scheduled job is there to catch any datasets for which that export did not succeed, for one reason or another). Also, since this functionality has been added in version 4.5: if you are upgrading from a previous version, none of your datasets are exported yet. So the first time this job runs, it will attempt to export them all.

This daily job will also update all the harvestable OAI sets configured on your server, adding new and/or newly published datasets or marking deaccessioned datasets as "deleted" in the corresponding sets as needed.

This job is automatically scheduled to run at 2AM local time every night. If really necessary, it is possible (for an advanced user) to change that time by directly editing the EJB timer application table in the database.
Loading

0 comments on commit 067201c

Please sign in to comment.