Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing dataset in an external repository #8001

Open
pkiraly opened this issue Jul 15, 2021 · 6 comments
Open

Publishing dataset in an external repository #8001

pkiraly opened this issue Jul 15, 2021 · 6 comments
Labels

Comments

@pkiraly
Copy link
Member

pkiraly commented Jul 15, 2021

We have a speific feature request, which I think would worth it to solve it with a general solution.

The original request: if a user create an Arts and Humanities dataset, s/he should be able to publish it as well on an external reporitory called DARIAH Repository.

As we know the slogan "lots of copies keep your stuff safe" I believe it would be a valid and supportable use case to create copies of the dataset into external reporitories.

Here is a suggestion for the user interface:

external-repository

The backend and the workflow would like something like this:

  • there should be an ExternalRepository interface, which declares some basic methods to manage this publication process, such as
    • getName(): returns the name of the repository
    • getUrl(): returns the URL of the repository's starting page
    • publish(DatasetVersion datasetVersion): the main method which publish the dataset in the repository
    • isActive(): returns if the repository is turned on in the current Dataverse instance (by default all are turned off, the site admin can activate them via configuration
  • a number of implementations of this interface which responsible for an individual external repositories. Each might have different configuration settings, for authentication, API key etc. These implemntations are singletons.
  • mapping of subjects and these ExternalRepository implementation. It might be an enumeration or a database table
  • the authenticated user will see the above screen, where the list of the external repositories are generated based on the subjects and the status of the external repository objects.

Here are some code snippets, to get more details:

Mapping of subjects and repositories:

public enum Subject {
  SOCIAL_SCIENCES("Social Sciences", GesisRepository.getInstance()), // a social science repo
  MEDICINE("Medicine, Health and Life Sciences"),
  EARTH("Earth and Environmental Sciences"),
  AGRICULTUR("Agricultural Sciences"),
  OTHER("Other"),
  COMPUTER("Computer and Information Science"),
  HUMANITIES("Arts and Humanities", DariahRepository.getInstance()), // a Digital Humanities repo
  ASTRONOMY("Astronomy and Astrophysics"),
  BUSINESS("Business and Management"),
  LAW("Law"),
  ENGINEERING("Engineering"),
  MATHEMATICS("Mathematical Sciences"),
  CHEMISTRY("Chemistry"),
  PHYSICS("Physics")
  ;

  private String name;
  // initialize with general repositories, which could be available for all subjects
  private List<ExternalRepository> repositories = List.of(HarvardDataverse.getInstance(),
                                                          DataverseNo.getInstance());

  Subject(String name) {
    this.name = name;
  }

  Subject(String name, ExternalRepository... repositories) {
    this(name);
    this.repositories.addAll(Arrays.asList(repositories));
  }

  public static Subject byName(String name) {
    for (Subject subject : values())
      if (subject.name.equals(name))
        return subject;
    return null;
  }

  public String getName() {
    return name;
  }

  public List<ExternalRepository> getRepositories() {
    return repositories;
  }
}

get the list of active repositories:

public List<ExternalRepository> getActiveExternalRepositories() {
    List<ExternalRepository> repositories = new ArrayList<>();
    for (String name : getDatasetSubjects()) {
        Subject subject = Subject.byName(name);
        if (subject != null && subject.getRepositories() != null)
            for (ExternalRepository repository : subject.getRepositories())
                if (repository.isActive())
                    repositories.add(repository);
    }
    return repositories;
}

@pdurbin @qqmyers @poikilotherm @djbrooke @4tikhonov I am interested in your opinion. I have some initial code to prove the concept for myself, but for a PR it needs lots of work. I would invent this time only if this idea meets community's opinion. Otherwise I will create an independent webservice specific for the DARIAH repository.

@4tikhonov
Copy link
Contributor

4tikhonov commented Jul 15, 2021

Hi @pkiraly, it's well known use case. We have already developed such (external) webservice in 2017 to archive datasets in our Trusted Digital Repository (DANS EASY). However, our workflow is a bit different, first we're publishing dataset in Dataverse and using its metadata and files to create BagIt package, and archiving it afterwords. Please take a look on slides here: https://www.slideshare.net/vty/cessda-persistent-identifiers

Regarding your possible implementation, I'm pretty sure the development of webservices is the way to go. At the moment Dataverse looks too much monolithic and we have to make it prepared for the future using modern technologies and concepts.

@djbrooke
Copy link
Contributor

(I typed this response this morning and I got sidetracked, apologies :))

I think we'd want to utilize the workflows system (https://guides.dataverse.org/en/latest/developers/workflows.html) to trigger an event to publish into the other system, and I don't think we'd want to add a flow in the Dataverse UI for this. I'd be concerned about communicating failure cases and scalability.

@poikilotherm
Copy link
Contributor

This might be a good chance to revive discussing #7050. You already could extend Dataverse with a workflow, but this is not tied to the UI IIRC. A way to inject UI components for workflows from plugins would be great IMHO. Less forks, more extensibility.

@pkiraly
Copy link
Member Author

pkiraly commented Jul 16, 2021

Dear @djbrooke, @4tikhonov and @poikilotherm,

thanks a lot for your feedback and suggestions! I totally agree with the suggestion that Dataverse should not be extend but should work with plugins wherever it is possible.

I checked the suggested workflow documentation and the example scripts in the scripts/api/data/workflows directory, and my feeling is that it solves only one part of the feature request, i.e. the communication with external services. However an important part of our requirement is that (1) the uses should decide (2) on ad hoc basis whether or not s/he would like to publish the dataset on an external service. I do not see a possibility to set a condition parameter into the workflow which govers if the step should be executed or not.

To use the workflow for this requirement the following improvement should be taken:

  • there should be a setting on the user's page where s/he could set under which conditions the dataset should sent to an external archive
  • the workflow should have an extra parameter, for which the administrator either sets conditions, or ask the system to follow the conditions of the user, who publishes the dataset

Example for such a conditional step configuration:

example 1: direct entry of conditions, i.e. archive the dataset only if subject is "Arts and Humanities", the user if affiliated a Humanities organisation, and it is a new major version)

{
  "provider":":internal",
  "stepType":"http/sr",
  "parameters": {
    ...
    "conditions": [
      "${dataset.subject}=[Arts and Humanities]",
      "${user.affiliation}=[DARRIAH, Department of Humanities]",
      "${minorVersion}=0"
    ]
  }
}

example 2: the workflow should retrieve and evaluate the user's conditions, which have been set on the user's page or via API

{
  "provider":":internal",
  "stepType":"http/sr",
  "parameters": {
    ...
    "conditions": ["${user.externalArchivingConditions}"]
  }
}

A question: are you aware of any existing open source plugin for Dataverse I can check?

@pdurbin pdurbin changed the title Publishing dataset in an external reporitory Publishing dataset in an external repository Apr 21, 2022
@pdurbin
Copy link
Member

pdurbin commented Oct 14, 2022

@pkiraly maybe there's a better video or screenshots @qqmyers can point us to but there's now some UI for curators to see the status of publishing/archiving to another repository. The screenshot below is from "Final Demo - Full Final demo of automatic ingests of Dataverse exports into DRS, including successful, failed, and message error scenarios" at https://github.com/harvard-lts/awesome-lts#2022-06-29-final-demo via this pull request that was merged into 5.12 (just released):

Screen Shot 2022-10-14 at 7 35 33 AM

It seems highly related at least! I think it might use a command instead of a workflow though. (No, I can't think of any plugins you can check.)

@qqmyers
Copy link
Member

qqmyers commented Oct 14, 2022

FWIW: Automation is via workflow (i.e. configured to post-publish), but the workflow step calls an archiving command. Those are dynamically loaded so dropping a new one in the exploded war should work. (We haven't dealt with a separate class loader yet.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants