Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for auditing physical files and file metadata #11016

Merged
merged 25 commits into from
Dec 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
60d6f92
audit physical files
stevenwinship Nov 13, 2024
804d284
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
a62193c
Update doc/sphinx-guides/source/api/native-api.rst
stevenwinship Nov 19, 2024
d0df4f0
Update doc/sphinx-guides/source/api/native-api.rst
stevenwinship Nov 19, 2024
a1d1030
Update doc/sphinx-guides/source/api/native-api.rst
stevenwinship Nov 19, 2024
e433ee2
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
e4751c5
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
456f9f6
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 19, 2024
9b15681
Update src/main/java/edu/harvard/iq/dataverse/api/Admin.java
stevenwinship Nov 19, 2024
2586c33
fix camelcase for datasetIdentifierList
stevenwinship Nov 19, 2024
abfc738
fix camelcase for datasetIdentifierList
stevenwinship Nov 19, 2024
b64addc
reformat json output
stevenwinship Nov 19, 2024
e89f1ca
reformat json output
stevenwinship Nov 19, 2024
7e9aae9
reformat json output
stevenwinship Nov 19, 2024
11cbe85
reformat json output
stevenwinship Nov 19, 2024
3eec366
adding directory label to json and changing camelCase
stevenwinship Nov 19, 2024
26e8574
tabs to spaces
stevenwinship Nov 20, 2024
2db26b2
add pid
stevenwinship Nov 20, 2024
2c5aca8
fix typos
stevenwinship Nov 20, 2024
3c67a79
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 20, 2024
58d3235
Update doc/release-notes/220-harvard-edu-audit-files.md
stevenwinship Nov 20, 2024
50b752a
fix typos
stevenwinship Nov 20, 2024
a192c17
fix release note
stevenwinship Dec 2, 2024
e06e1d2
fix api doc
stevenwinship Dec 2, 2024
8c79f67
fix api doc
stevenwinship Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions doc/release-notes/220-harvard-edu-audit-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### New API to Audit Datafiles across the database

This is a superuser only API endpoint to audit Datasets with DataFiles where the physical files are missing or the file metadata is missing.
The Datasets scanned can be limited by optional firstId and lastId query parameters, or a given CSV list of Dataset Identifiers.
Once the audit report is generated, a superuser can either delete the missing file(s) from the Dataset or contact the author to re-upload the missing file(s).

The JSON response includes:
- List of files in each DataFile where the file exists in the database but the physical file is not in the file store.
- List of DataFiles where the FileMetadata is missing.
- Other failures found when trying to process the Datasets

curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles"
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/RVNT9Q,doi:10.5072/FK2/RVNT9Q"

For more information, see [the docs](https://dataverse-guide--11016.org.readthedocs.build/en/11016/api/native-api.html#datafile-audit), #11016, and [#220](https://github.com/IQSS/dataverse.harvard.edu/issues/220)
66 changes: 66 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6200,6 +6200,72 @@ Note that if you are attempting to validate a very large number of datasets in y

asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600

Datafile Audit
~~~~~~~~~~~~~~

Produce an audit report of missing files and FileMetadata for Datasets.
Scans the Datasets in the database and verifies that the stored files exist. If the files are missing or if the FileMetadata is missing, this information is returned in a JSON response.
The call will return a status code of 200 if the report was generated successfully. Issues found will be documented in the report and will not return a failure status code unless the report could not be generated::

curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles"

Optional Parameters are available for filtering the Datasets scanned.

For auditing the Datasets in a paged manner (firstId and lastId)::

curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"

Auditing specific Datasets (comma separated list)::

curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/JXYBJS,doi:10.7910/DVN/MPU019"

Sample JSON Audit Response::

{
"status": "OK",
"data": {
"firstId": 0,
"lastId": 100,
"datasetIdentifierList": [
"doi:10.5072/FK2/XXXXXX",
"doi:10.5072/FK2/JXYBJS",
"doi:10.7910/DVN/MPU019"
],
"datasetsChecked": 100,
"datasets": [
{
"id": 6,
"pid": "doi:10.5072/FK2/JXYBJS",
"persistentURL": "https://doi.org/10.5072/FK2/JXYBJS",
"missingFileMetadata": [
{
"storageIdentifier": "local://1930cce4f2d-855ccc51fcbb",
"dataFileId": "7"
}
Comment on lines +6240 to +6244
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is out of scope for this PR, but if there is missing file metadata, what do I do? Can I fix this via API? Or do I have to hack on the database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the owner has to delete the file and re-upload it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds right. Thanks.

]
},
{
"id": 47731,
"pid": "doi:10.5072/FK2/MPU019",
"persistentURL": "https://doi.org/10.7910/DVN/MPU019",
"missingFiles": [
{
"storageIdentifier": "s3://dvn-cloud:298910",
"directoryLabel": "trees",
"label": "trees.png"
}
]
}
],
"failures": [
{
"datasetIdentifier": "doi:10.5072/FK2/XXXXXX",
"reason": "Not Found"
}
]
}
}

Workflows
~~~~~~~~~

Expand Down
Loading
Loading