Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DD-1755 #11

Merged
merged 19 commits into from
Jan 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 98 additions & 24 deletions docs/description.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,24 +53,24 @@ The init file initializes the ingest process. It can be used to verify that an e
```yaml
init:
expect:
state: 'released' # or 'draft', 'absent'.
state: 'released' # or 'draft'
dataverseRoleAssignment:
assignee: '@myuser'
role: ':autenticated-user'
assignee: '@myuser'
role: ':autenticated-user'
datasetRoleAssignment:
assignee: '@myuser'
role: 'contributor'
assignee: '@myuser'
role: 'contributor'

```

If the state of the dataset does not match the expected state, the ingest procedure will be aborted and the deposit will be put in the FAILED state. The
expected state can be either `released`, `draft` or `absent` (meaning that the dataset should not exist). By default, no check will be performed.
If the state of the dataset does not match the expected state, the ingest procedure will be aborted and the deposit will be put in the FAILED state. The
expected state can be either `released` or `draft`. By default, no check will be performed.

If the role assignment in `dataverseRoleAssignment` does not exist on the target dataverse collection (currently always `root`) the ingest procedure will be
If the role assignment in `dataverseRoleAssignment` does not exist on the target dataverse collection (currently always `root`) the ingest procedure will be
aborted putting the deposit in a REJECTED state. The same happens if the role assignment in `datasetRoleAssignment` does not exist on the target dataset
(only applicable to update-deposits).
(only applicable to update-deposits).

Note the difference in the resulting deposit state when expectations are not met. The rationale is that an unexpected dataset state is likely an error by the
Note the difference in the resulting deposit state when expectations are not met. The rationale is that an unexpected dataset state is likely an error by the
service or Dataverse and the expected role assignments are a kind of authorization check.

The `init.yml` file can also be used to instruct the service to import the bag as a dataset with an existing DOI:
Expand Down Expand Up @@ -136,27 +136,28 @@ editFiles:
# directory under the same path as the original file has in the dataset. Note that files in `replaceFiles` will automatically be skipped in the add files step,
replaceFiles:
- 'file2.txt'
# Adds files to the dataset and makes them unrestricted. The files are processed in batches, meaning they are uploaded as ZIP files to Dataverse, for Dataverse
# to unzip them. This is more efficient than adding the files one by one.
addUnrestrictedFiles:
- 'file6.txt'
# Adds files to the dataset and makes them restricted. The files are processed in batches, meaning they are uploaded as ZIP files to Dataverse, for Dataverse
# to unzip them. This is more efficient than adding the files one by one.
addRestrictedFiles:
- 'file4.txt'
- 'subdirectory/file5.txt'
# The same as above, but the files are added unrestricted.
addUnrestrictedFiles:
- 'file6.txt'
# Adds files to the dataset and makes them restricted. The files are processed one by one, meaning they are uploaded as individual files to Dataverse. This is
# Adds files to the dataset and makes them unrestricted. The files are processed one by one, meaning they are uploaded as individual files to Dataverse. This is
# useful if you need to circumvent special processing by Dataverse, such as re-zipping Shapefile projects.
# See: https://guides.dataverse.org/en/6.3/developers/geospatial.html#geospatial-data
#
# Conversely, you could use this to make sure a ZIP file in your bag is expanded by Dataverse, since Dataverse will not expand ZIP files that are uploaded
# inside another ZIP file, as is the case with the batch upload method.
addRestrictedFilesIndividually:
- 'bicycles.shp'
- 'bicycles.shx'
- 'bicycles.dbf'
- 'bicycles.prj'
# The same as above, but the files are added unrestricted.
addUnrestrictedFilesIndividually:
- 'bicycles.shp'
- 'bicycles.shx'
- 'bicycles.dbf'
- 'bicycles.prj'
# The same as above, but the files are added restricted.
addRestrictedFilesIndividually:
- 'bicycles.shp'
- 'bicycles.shx'
- 'bicycles.dbf'
Expand All @@ -173,15 +174,15 @@ editFiles:
directoryLabel: "subdirectory"
restricted: false
categories: [ 'Testlabel' ]
# This is not a separate step, but the auto-renaming takes place whenever a local filepath is translated to a dataset filepath.
autoRenameFiles:
- from: "Unsanitize'd/file?" # Local file name
to: "Sanitize_d/file_" # The file name assigned in the dataset
# Sets one or more embargoes on the files in the dataset.
addEmbargoes:
- filePaths: [ 'file1.txt' ] # All other files will NOT be embargoed
dateAvailable: '2030-01-01'
reason: 'Pending publication'
# This is not a separate step, but the auto-renaming takes place whenever a local filepath is translated to a dataset filepath.
autoRenameFiles:
- from: "Unsanitize'd/file?" # Local file name
to: "Sanitize_d/file_" # The file name assigned in the dataset
```

##### edit-metadata.yml
Expand Down Expand Up @@ -333,3 +334,76 @@ deposit.
The actions described in the Yaml files will be executed in same order as they are listed above. Note that changing the order of the actions in the Yaml files
has no effect on the order in which they are executed. All files and all action fields (e.g., `addRestrictedFiles`) are optional, except for `dataset.yml`, when
creating a new dataset.

### The task log

The service keeps the progress of the processing in file called `_tasks.yml`. Its layout corresponds closely to the combined layout of the instruction Yaml
files. (The underscore in the name is there to make it stand out in the directory listing.):

```yaml
taskLog:
init:
targetPid: null
expect:
state:
completed: false
dataverseRoleAssignment:
completed: false
datasetRoleAssignment:
completed: false
create:
completed: false
dataset:
completed: false
editFiles:
deleteFiles:
completed: false
numberCompleted: 0
replaceFiles:
completed: false
numberCompleted: 0
addUnrestrictedFiles:
completed: false
numberCompleted: 0
addRestrictedFiles:
completed: false
numberCompleted: 0
addUnrestrictedIndividually:
completed: false
numberCompleted: 0
addRestrictedIndividually:
completed: false
numberCompleted: 0
moveFiles:
completed: false
numberCompleted: 0
updateFileMetas:
completed: false
numberCompleted: 0
addEmbargoes:
completed: false
numberCompleted: 0
editMetadata:
addFieldValues:
completed: false
replaceFieldValues:
completed: false
deleteFieldValues:
completed: false
editPermissions:
deleteRoleAssignments:
completed: false
numberCompleted: 0
addRoleAssignments:
completed: false
numberCompleted: 0
updateState:
completed: false
```

The file is updated in memory and will be written to the root of the bag when the processing of the bag is finished or fails. If at the start of processing the
bag the file is found in the root of the bag, the service will continue from where it left off. Note, that some items have a `numberCompleted` field, so if the
overall task is not yet completed, the service will continue from where it left off.

If you want to re-ingest a deposit completely, delete the `_tasks.yml` file from the root of the deposit. You should probably also delete the dataset version
that was created in the previous run.
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
package nl.knaw.dans.dvingest.core;

import io.dropwizard.configuration.ConfigurationException;
import lombok.Getter;
import nl.knaw.dans.dvingest.core.service.YamlService;
import nl.knaw.dans.dvingest.core.service.YamlServiceImpl;
import nl.knaw.dans.dvingest.core.yaml.EditFiles;
Expand All @@ -28,6 +29,8 @@
import nl.knaw.dans.dvingest.core.yaml.InitRoot;
import nl.knaw.dans.dvingest.core.yaml.UpdateAction;
import nl.knaw.dans.dvingest.core.yaml.UpdateStateRoot;
import nl.knaw.dans.dvingest.core.yaml.tasklog.TaskLog;
import nl.knaw.dans.dvingest.core.yaml.tasklog.TaskLogRoot;
import nl.knaw.dans.lib.dataverse.model.dataset.Dataset;

import java.io.IOException;
Expand All @@ -44,16 +47,32 @@ public class DataverseIngestBag implements Comparable<DataverseIngestBag> {
public static final String EDIT_METADATA_YML = "edit-metadata.yml";
public static final String EDIT_PERMISSIONS_YML = "edit-permissions.yml";
public static final String UPDATE_STATE_YML = "update-state.yml";
public static final String TASK_LOG_YAML = "_tasks.yml";

private final Path bagDir;
@Getter
private final TaskLog taskLog;

public DataverseIngestBag(Path bagDir, YamlService yamlService) {
public DataverseIngestBag(Path bagDir, YamlService yamlService) throws IOException {
this.bagDir = bagDir;
this.yamService = (YamlServiceImpl) yamlService;
// Minimal check to see if it is a bag
if (!Files.exists(bagDir.resolve("bagit.txt"))) {
throw new IllegalStateException("Not a bag: " + bagDir);
}
if (!Files.exists(bagDir.resolve(TASK_LOG_YAML))) {
taskLog = new TaskLog();
saveTaskLog();
}
else {
try {
var actionLogRoot = yamlService.readYaml(bagDir.resolve(TASK_LOG_YAML), TaskLogRoot.class);
taskLog = actionLogRoot.getTaskLog();
}
catch (ConfigurationException e) {
throw new IllegalStateException("Error reading action log", e);
}
}
}

public boolean looksLikeDansBag() {
Expand Down Expand Up @@ -105,10 +124,14 @@ public UpdateAction getUpdateState() throws IOException, ConfigurationException
if (!Files.exists(bagDir.resolve(UPDATE_STATE_YML))) {
return null;
}
var updateStateRoot = yamService.readYaml(bagDir.resolve(UPDATE_STATE_YML), UpdateStateRoot.class);
var updateStateRoot = yamService.readYaml(bagDir.resolve(UPDATE_STATE_YML), UpdateStateRoot.class);
return updateStateRoot.getUpdateState();
}

public void saveTaskLog() throws IOException {
yamService.writeYaml(new TaskLogRoot(taskLog), bagDir.resolve(TASK_LOG_YAML));
}

@Override
public int compareTo(DataverseIngestBag dataverseIngestBag) {
return bagDir.getFileName().toString().compareTo(dataverseIngestBag.bagDir.getFileName().toString());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.OffsetDateTime;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
Expand Down Expand Up @@ -77,13 +78,14 @@ public boolean convertDansDepositIfNeeded() {

@Override
public List<DataverseIngestBag> getBags() throws IOException {
List<DataverseIngestBag> bags = new ArrayList<>();
try (var files = Files.list(location)) {
return files
.filter(Files::isDirectory)
.map(path -> new DataverseIngestBag(path, yamlService))
.sorted()
.toList();
for (Path path : files.filter(Files::isDirectory).toList()) {
bags.add(new DataverseIngestBag(path, yamlService));
}
}
bags.sort(null);
return bags;
}

public void updateProperties(Map<String, String> properties) {
Expand Down
60 changes: 34 additions & 26 deletions src/main/java/nl/knaw/dans/dvingest/core/ImportJob.java
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ public class ImportJob implements Runnable {
private final ImportCommandDto importCommand;
@NonNull
private final Path outputDir;
private boolean onlyConvertDansDeposit;
private final boolean onlyConvertDansDeposit;
private final DataverseIngestDepositFactory depositFactory;
private final DepositTaskFactory depositTaskFactory;

Expand Down Expand Up @@ -62,6 +62,18 @@ public void run() {
try {
log.debug("Starting import job: {}", importCommand);
status.setStatus(StatusEnum.RUNNING);
var deposits = createDataverseIngestDeposits();
initOutputDir();
processDeposits(deposits);
}
catch (Exception e) {
log.error("Failed to process import job", e);
status.setStatus(StatusEnum.FAILED);
status.setMessage(e.getMessage());
}
}

private TreeSet<DataverseIngestDeposit> createDataverseIngestDeposits() throws IOException {
var deposits = new TreeSet<DataverseIngestDeposit>();

if (importCommand.getSingleObject()) {
Expand All @@ -75,31 +87,7 @@ public void run() {
.forEach(deposits::add);
}
}

initOutputDir();

for (DataverseIngestDeposit dataverseIngestDeposit : deposits) {
if (cancelled) {
log.info("Import job cancelled");
status.setStatus(StatusEnum.DONE);
status.setMessage("Import job cancelled");
return;
}

log.info("START Processing deposit: {}", dataverseIngestDeposit.getId());
var task = depositTaskFactory.createDepositTask(dataverseIngestDeposit, outputDir, onlyConvertDansDeposit);
task.run();
log.info("END Processing deposit: {}", dataverseIngestDeposit.getId());
// TODO: record number of processed/rejected/failed deposits in ImportJob status
}

status.setStatus(StatusEnum.DONE);
}
catch (Exception e) {
log.error("Failed to process import job", e);
status.setStatus(StatusEnum.FAILED);
status.setMessage(e.getMessage());
}
return deposits;
}

private void initOutputDir() {
Expand Down Expand Up @@ -133,4 +121,24 @@ private void checkDirectoryEmpty(Path path) {
throw new IllegalStateException("Failed to check directory: " + path, e);
}
}

private void processDeposits(TreeSet<DataverseIngestDeposit> deposits) {
for (DataverseIngestDeposit dataverseIngestDeposit : deposits) {
if (cancelled) {
log.info("Import job cancelled");
status.setMessage("Import job cancelled");
status.setStatus(StatusEnum.DONE);
return;
}
else {
log.info("START Processing deposit: {}", dataverseIngestDeposit.getId());
var task = depositTaskFactory.createDepositTask(dataverseIngestDeposit, outputDir, onlyConvertDansDeposit);
task.run();
log.info("END Processing deposit: {}", dataverseIngestDeposit.getId());
// TODO: record number of processed/rejected/failed deposits in ImportJob status
}
}
status.setMessage("Import job completed");
status.setStatus(StatusEnum.DONE);
}
}
Loading
Loading