Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for auditing physical files and file metadata #11016

Merged
merged 25 commits into from
Dec 2, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
60d6f92
audit physical files
stevenwinship Nov 13, 2024
804d284
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
a62193c
Update doc/sphinx-guides/source/api/native-api.rst
stevenwinship Nov 19, 2024
d0df4f0
Update doc/sphinx-guides/source/api/native-api.rst
stevenwinship Nov 19, 2024
a1d1030
Update doc/sphinx-guides/source/api/native-api.rst
stevenwinship Nov 19, 2024
e433ee2
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
e4751c5
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
456f9f6
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
9b15681
Update src/main/java/edu/harvard/iq/dataverse/api/Admin.java
stevenwinship Nov 19, 2024
2586c33
fix camelcase for datasetIdentifierList
stevenwinship Nov 19, 2024
abfc738
fix camelcase for datasetIdentifierList
stevenwinship Nov 19, 2024
b64addc
reformat json output
stevenwinship Nov 19, 2024
e89f1ca
reformat json output
stevenwinship Nov 19, 2024
7e9aae9
reformat json output
stevenwinship Nov 19, 2024
11cbe85
reformat json output
stevenwinship Nov 19, 2024
3eec366
adding directory label to json and changing camelCase
stevenwinship Nov 19, 2024
26e8574
tabs to spaces
stevenwinship Nov 20, 2024
2db26b2
add pid
stevenwinship Nov 20, 2024
2c5aca8
fix typos
stevenwinship Nov 20, 2024
3c67a79
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 20, 2024
58d3235
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 20, 2024
50b752a
fix typos
stevenwinship Nov 20, 2024
a192c17
fix release note
stevenwinship Dec 2, 2024
e06e1d2
fix api doc
stevenwinship Dec 2, 2024
8c79f67
fix api doc
stevenwinship Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions doc/release-notes/220-harvard-edu-audit-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### New API to Audit Datafiles across the database

This is a superuser only tool to audit Datasets with DataFiles where the physical files are missing or the file metadata is missing.
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved
The Datasets scanned can be limited by optional firstId and lastId query parameters, or a given CSV list of Dataset Identifiers.
Once the audit report is generated, an Administrator can either delete the missing file(s) from the Dataset or contact the author to re-upload the missing file(s).
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved

The Json response includes:
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved
- List of files in each DataFile where the file exists in the database but the physical file is not on the file store.
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved
- List of DataFiles where the FileMetadata is missing.
- Other failures found when trying to process the Datasets

curl "http://localhost:8080/api/admin/datafiles/auditFiles
curl "http://localhost:8080/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
curl "http://localhost:8080/api/admin/datafiles/auditFiles?DatasetIdentifierList=doi:10.5072/FK2/RVNT9Q,doi:10.5072/FK2/RVNT9Q

For more information, see issue [#220](https://github.com/IQSS/dataverse.harvard.edu/issues/220)
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved
55 changes: 55 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6200,6 +6200,61 @@ Note that if you are attempting to validate a very large number of datasets in y

asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600

Datafile Audit
~~~~~~~~~~~~~~

Produce an Audit report of missing files and FileMetadata for Datasets.
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved
Scans the Datasets in the database and verifies that the stored files exist. If the files are missing or if the FileMetadata is missing this information is returned in a Json response::
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved

curl "$SERVER_URL/api/admin/datafiles/auditFiles"

Optional Parameters are available for filtering the Datasets scanned.

For auditing the Datasets in a paged manor (firstId and lastId)::
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved

curl "$SERVER_URL/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"

Auditing specific Datasets (comma separated list)::

curl "$SERVER_URL/api/admin/datafiles/auditFiles?DatasetIdentifierList=doi.org/10.5072/FK2/JXYBJS,doi.org/10.7910/DVN/MPU019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this pattern of passing in the URL form of a PID minus "https://" anywhere else? It seems ok. Can we pass in the normal PIDs (the non-URL form) instead?

Copy link
Contributor Author

@stevenwinship stevenwinship Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's different... "doi.org/10... vs. doi:10...".

In this PR we. should use the pattern from reExportDataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the doc


Sample Json Audit Response::
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved

{
"status": "OK",
"data": {
"firstId": 0,
"lastId": 100,
"DatasetIdentifierList": [
"doi.org/10.5072/FK2/XXXXXX",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Just the PID would be better than the URL form without "https://".

Copy link
Contributor Author

@stevenwinship stevenwinship Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is eventually passed to PidUtil.parseAsGlobalID() so it will work with any format that works with that method. So whatever you type into that list will show up, as is, in the json "DatasetIdentifierList": []. It's just there to document what you passed in. Even if it's garbage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗑️ in 🗑️ out, as they say!

"doi.org/10.5072/FK2/JXYBJS",
"doi.org/10.7910/DVN/MPU019"
],
"datasetsChecked": 100,
"datasets": [
{
"id": 6,
"identifier": "FK2/JXYBJS",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might technically be the identifier from the database but what about other types of PIDs like Handles and Permalinks? Let's not make users parse "persistentURL". We could expose this information separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added authority and protocol. Was this what you wanted or is there more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to be a pain but I think I'd rather have the full PID (e.g. "doi:10.5072/FK2/JXYBJS) than each field split out.

I mean, if you want to leave protocol, authority, and identifier fields in, I won't object, but having the full PID is useful, I think. The full PID is what we operate on in a lot of cases, including this "audit" API we're adding.

"persistentURL": "https://doi.org/10.5072/FK2/JXYBJS",
"missingFileMetadata": [
"local://1930cce4f2d-855ccc51fcbb, DataFile Id:7"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easier parsing, should this be a JSON object with entries? Instead of a string where you split on comma then colon.

]
},
{
"id": 47731,
"identifier": "DVN/MPU019",
"persistentURL": "https://doi.org/10.7910/DVN/MPU019",
"missingFiles": [
"s3://dvn-cloud:298910, jihad_metadata_edited.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Easier parsing would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-formatted the json output:

"missingFiles": [
{
"StorageIdentifier": "s3://dvn-cloud:298910",
"label": "jihad_metadata_edited.csv"
}
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks. Do we need the directoryLabel too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added directoryLabel

]
}
],
"failures": [
"DatasetIdentifier Not Found: doi.org/10.5072/FK2/XXXXXX"
]
}
}

Workflows
~~~~~~~~~

Expand Down
161 changes: 136 additions & 25 deletions src/main/java/edu/harvard/iq/dataverse/api/Admin.java
Original file line number Diff line number Diff line change
@@ -1,28 +1,11 @@
package edu.harvard.iq.dataverse.api;

import edu.harvard.iq.dataverse.BannerMessage;
import edu.harvard.iq.dataverse.BannerMessageServiceBean;
import edu.harvard.iq.dataverse.BannerMessageText;
import edu.harvard.iq.dataverse.DataFile;
import edu.harvard.iq.dataverse.DataFileServiceBean;
import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.DatasetServiceBean;
import edu.harvard.iq.dataverse.DatasetVersion;
import edu.harvard.iq.dataverse.DatasetVersionServiceBean;
import edu.harvard.iq.dataverse.Dataverse;
import edu.harvard.iq.dataverse.DataverseRequestServiceBean;
import edu.harvard.iq.dataverse.DataverseServiceBean;
import edu.harvard.iq.dataverse.DataverseSession;
import edu.harvard.iq.dataverse.DvObject;
import edu.harvard.iq.dataverse.DvObjectServiceBean;
import edu.harvard.iq.dataverse.*;
import edu.harvard.iq.dataverse.api.auth.AuthRequired;
import edu.harvard.iq.dataverse.settings.JvmSettings;
import edu.harvard.iq.dataverse.util.StringUtil;
import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder;
import edu.harvard.iq.dataverse.validation.EMailValidator;
import edu.harvard.iq.dataverse.EjbDataverseEngine;
import edu.harvard.iq.dataverse.Template;
import edu.harvard.iq.dataverse.TemplateServiceBean;
import edu.harvard.iq.dataverse.UserServiceBean;
import edu.harvard.iq.dataverse.actionlogging.ActionLogRecord;
import edu.harvard.iq.dataverse.api.dto.RoleDTO;
import edu.harvard.iq.dataverse.authorization.AuthenticatedUserDisplayInfo;
Expand Down Expand Up @@ -66,8 +49,9 @@
import java.io.InputStream;
import java.io.StringReader;
import java.nio.charset.StandardCharsets;
import java.util.Map;
import java.util.*;
import java.util.Map.Entry;
import java.util.function.Predicate;
import java.util.logging.Level;
import java.util.logging.Logger;
import jakarta.ejb.EJB;
Expand All @@ -81,7 +65,6 @@

import org.apache.commons.io.IOUtils;

import java.util.List;
import edu.harvard.iq.dataverse.authorization.AuthTestDataServiceBean;
import edu.harvard.iq.dataverse.authorization.AuthenticationProvidersRegistrationServiceBean;
import edu.harvard.iq.dataverse.authorization.DataverseRole;
Expand Down Expand Up @@ -118,17 +101,14 @@
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.json;
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.rolesToJson;
import static edu.harvard.iq.dataverse.util.json.JsonPrinter.toJsonArray;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Date;

import jakarta.inject.Inject;
import jakarta.json.JsonArray;
import jakarta.persistence.Query;
import jakarta.ws.rs.QueryParam;
import jakarta.ws.rs.WebApplicationException;
import jakarta.ws.rs.core.StreamingOutput;
import java.nio.file.Paths;
import java.util.TreeMap;

/**
* Where the secure, setup API calls live.
Expand Down Expand Up @@ -2541,4 +2521,135 @@ public Response getFeatureFlag(@PathParam("flag") String flagIn) {
}
}

@GET
@AuthRequired
@Path("/datafiles/auditFiles")
public Response getAuditFiles(@Context ContainerRequestContext crc,
@QueryParam("firstId") Long firstId, @QueryParam("lastId") Long lastId,
@QueryParam("DatasetIdentifierList") String DatasetIdentifierList) throws WrappedResponse {
stevenwinship marked this conversation as resolved.
Show resolved Hide resolved
try {
AuthenticatedUser user = getRequestAuthenticatedUserOrDie(crc);
if (!user.isSuperuser()) {
return error(Response.Status.FORBIDDEN, "Superusers only.");
}
} catch (WrappedResponse wr) {
return wr.getResponse();
}

List<String> failures = new ArrayList<>();
int datasetsChecked = 0;
long startId = (firstId == null ? 0 : firstId);
long endId = (lastId == null ? Long.MAX_VALUE : lastId);

List<String> datasetIdentifiers;
if (DatasetIdentifierList == null || DatasetIdentifierList.isEmpty()) {
datasetIdentifiers = Collections.emptyList();
} else {
startId = 0;
endId = Long.MAX_VALUE;
datasetIdentifiers = List.of(DatasetIdentifierList.split(","));
}
if (endId < startId) {
return badRequest("Invalid Parameters: lastId must be equal to or greater than firstId");
}

NullSafeJsonBuilder jsonObjectBuilder = NullSafeJsonBuilder.jsonObjectBuilder();
if (startId > 0) {
jsonObjectBuilder.add("firstId", startId);
}
if (endId < Long.MAX_VALUE) {
jsonObjectBuilder.add("lastId", endId);
}

// compile the list of ids to process
List<Long> datasetIds;
if (datasetIdentifiers.isEmpty()) {
datasetIds = datasetService.findAllLocalDatasetIds();
} else {
datasetIds = new ArrayList<>(datasetIdentifiers.size());
JsonArrayBuilder jab = Json.createArrayBuilder();
datasetIdentifiers.forEach(id -> {
String dId = id.trim();
jab.add(dId);
Dataset d = datasetService.findByGlobalId(dId);
if (d != null) {
datasetIds.add(d.getId());
} else {
failures.add("DatasetIdentifier Not Found: " + dId);
}
});
jsonObjectBuilder.add("DatasetIdentifierList", jab);
}

JsonArrayBuilder jsonDatasetsArrayBuilder = Json.createArrayBuilder();
for (Long datasetId : datasetIds) {
if (datasetId < startId) {
continue;
} else if (datasetId > endId) {
break;
}
Dataset dataset;
try {
dataset = findDatasetOrDie(String.valueOf(datasetId));
datasetsChecked++;
} catch (WrappedResponse ex) {
failures.add("DatasetId:" + datasetId + " Reason:" + ex.getMessage());
continue;
}

List<String> missingFiles = new ArrayList<>();
List<String> missingFileMetadata = new ArrayList<>();
try {
Predicate<String> filter = s -> true;
StorageIO<DvObject> datasetIO = DataAccess.getStorageIO(dataset);
final List<String> result = datasetIO.cleanUp(filter, true);
// add files that are in dataset files but not in cleanup result or DataFiles with missing FileMetadata
dataset.getFiles().forEach(df -> {
try {
StorageIO<DataFile> datafileIO = df.getStorageIO();
String storageId = df.getStorageIdentifier();
FileMetadata fm = df.getFileMetadata();
if (!datafileIO.exists()) {
missingFiles.add(storageId + ", " + (fm != null ? fm.getLabel() : df.getContentType()));
}
if (fm == null) {
missingFileMetadata.add(storageId + ", DataFile Id:" + df.getId());
}
} catch (IOException e) {
failures.add("DataFileId:" + df.getId() + ", " + e.getMessage());
}
});
} catch (IOException e) {
failures.add("DatasetId:" + datasetId + ", " + e.getMessage());
}

JsonObjectBuilder job = Json.createObjectBuilder();
if (!missingFiles.isEmpty() || !missingFileMetadata.isEmpty()) {
job.add("id", dataset.getId());
job.add("identifier", dataset.getIdentifier());
job.add("persistentURL", dataset.getPersistentURL());
if (!missingFileMetadata.isEmpty()) {
JsonArrayBuilder jabMissingFileMetadata = Json.createArrayBuilder();
missingFileMetadata.forEach(jabMissingFileMetadata::add);
job.add("missingFileMetadata", jabMissingFileMetadata);
}
if (!missingFiles.isEmpty()) {
JsonArrayBuilder jabMissingFiles = Json.createArrayBuilder();
missingFiles.forEach(jabMissingFiles::add);
job.add("missingFiles", jabMissingFiles);
}
jsonDatasetsArrayBuilder.add(job);
}
}

jsonObjectBuilder.add("datasetsChecked", datasetsChecked);
jsonObjectBuilder.add("datasets", jsonDatasetsArrayBuilder);
if (!failures.isEmpty()) {
JsonArrayBuilder jsonFailuresArrayBuilder = Json.createArrayBuilder();
failures.forEach(jsonFailuresArrayBuilder::add);
jsonObjectBuilder.add("failures", jsonFailuresArrayBuilder);
}

return ok(jsonObjectBuilder);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -753,6 +753,12 @@ public Path getFileSystemPath() throws UnsupportedDataAccessOperationException {

@Override
public boolean exists() {
try {
key = getMainFileKey();
} catch (IOException e) {
logger.warning("Caught an IOException in S3AccessIO.exists(): " + e.getMessage());
return false;
}
String destinationKey = null;
if (dvObject instanceof DataFile) {
destinationKey = key;
Expand Down
49 changes: 46 additions & 3 deletions src/test/java/edu/harvard/iq/dataverse/api/AdminIT.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
import java.util.HashMap;
import java.util.List;

import jakarta.json.Json;
import jakarta.json.JsonArray;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.BeforeAll;
Expand All @@ -26,13 +28,11 @@

import java.util.Map;
import java.util.UUID;
import java.util.logging.Level;
import java.util.logging.Logger;

import static jakarta.ws.rs.core.Response.Status.*;
import static org.hamcrest.CoreMatchers.*;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.hamcrest.CoreMatchers.equalTo;
import static org.hamcrest.CoreMatchers.notNullValue;
import static org.junit.jupiter.api.Assertions.assertTrue;

public class AdminIT {
Expand Down Expand Up @@ -901,6 +901,49 @@ public void testDownloadTmpFile() throws IOException {
.body("message", equalTo("Path must begin with '/tmp' but after normalization was '/etc/passwd'."));
}

@Test
public void testFindMissingFiles() {
Response createUserResponse = UtilIT.createRandomUser();
createUserResponse.then().assertThat().statusCode(OK.getStatusCode());
String username = UtilIT.getUsernameFromResponse(createUserResponse);
String apiToken = UtilIT.getApiTokenFromResponse(createUserResponse);
UtilIT.setSuperuserStatus(username, true);

String dataverseAlias = ":root";
Response createDatasetResponse = UtilIT.createRandomDatasetViaNativeApi(dataverseAlias, apiToken);
createDatasetResponse.prettyPrint();
createDatasetResponse.then().assertThat().statusCode(CREATED.getStatusCode());
int datasetId = JsonPath.from(createDatasetResponse.body().asString()).getInt("data.id");
String datasetPersistentId = JsonPath.from(createDatasetResponse.body().asString()).getString("data.persistentId");

// Upload file
Response uploadResponse = UtilIT.uploadRandomFile(datasetPersistentId, apiToken);
uploadResponse.then().assertThat().statusCode(CREATED.getStatusCode());

// Audit files
Response resp = UtilIT.auditFiles(apiToken, null, 100L, null);
resp.prettyPrint();
JsonArray emptyArray = Json.createArrayBuilder().build();
resp.then().assertThat()
.statusCode(OK.getStatusCode())
.body("data.lastId", equalTo(100));

// Audit files with invalid parameters
resp = UtilIT.auditFiles(apiToken, 100L, 0L, null);
resp.prettyPrint();
resp.then().assertThat()
.statusCode(BAD_REQUEST.getStatusCode())
.body("status", equalTo("ERROR"))
.body("message", equalTo("Invalid Parameters: lastId must be equal to or greater than firstId"));

// Audit files with list of dataset identifiers parameter
resp = UtilIT.auditFiles(apiToken, 1L, null, "bad/id, " + datasetPersistentId);
resp.prettyPrint();
resp.then().assertThat()
.statusCode(OK.getStatusCode())
.body("data.failures[0]", equalTo("DatasetIdentifier Not Found: bad/id"));
}

private String createTestNonSuperuserApiToken() {
Response createUserResponse = UtilIT.createRandomUser();
createUserResponse.then().assertThat().statusCode(OK.getStatusCode());
Expand Down
16 changes: 16 additions & 0 deletions src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,22 @@ public static Response clearThumbnailFailureFlag(long fileId) {
return response;
}

public static Response auditFiles(String apiToken, Long firstId, Long lastId, String csvList) {
String params = "";
if (firstId != null) {
params = "?firstId="+ firstId;
}
if (lastId != null) {
params = params + (params.isEmpty() ? "?" : "&") + "lastId="+ lastId;
}
if (csvList != null) {
params = params + (params.isEmpty() ? "?" : "&") + "DatasetIdentifierList="+ csvList;
}
return given()
.header(API_TOKEN_HTTP_HEADER, apiToken)
.get("/api/admin/datafiles/auditFiles" + params);
}

private static String getAuthenticatedUserAsJsonString(String persistentUserId, String firstName, String lastName, String authenticationProviderId, String identifier) {
JsonObjectBuilder builder = Json.createObjectBuilder();
builder.add("authenticationProviderId", authenticationProviderId);
Expand Down
Loading