New API option to get the download size of the dataset version files without the original tabular files size #9960

GPortas · 2023-09-26T17:33:46Z

What this PR does / why we need it:

Added a new optional query parameter "mode" to the "getDownloadSize" API endpoint ("api/datasets/{identifier}/versions/{versionId}/downloadsize").

This parameter applies a filter criteria to the operation and supports the following values:

All (Default): Includes both archival and original sizes for tabular files
Archival: Includes only the archival size for tabular files
Original: Includes only the original size for tabular files

Which issue(s) this PR closes:

Closes [Spike - API] Extend API to get files download size without original tabular files size #9958

Special notes for your reviewer:

More context on the implementation in the conversation of [Spike - API] Extend API to get files download size without original tabular files size #9958

Suggestions on how to test this:

Upload tabular files to a dataset and test the endpoint via curl.

Ignoring the tabular original sizes:

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=Archival"

Including the tabular original and archival sizes (current behavior):

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize"

The above output should be the same as for:

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=All"

Count only the original sizes (ignores archival size for tab files):

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=Original"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

Yes

Additional documentation:

No

…oadSize datasets API endpoint

…inal tab file sizes

GPortas · 2023-09-27T09:35:50Z

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/GetDatasetStorageSizeCommand.java

        if (dataset == null) {
            // should never happen - must indicate some data corruption in the database
            throw new CommandException(BundleUtil.getStringFromBundle("datasets.api.listing.error"), this);
        }

+        logger.fine("getDataverseStorageSize called on " + dataset.getDisplayName());


I have changed this order to avoid NPE when logging.

…s for /downloadsize API endpoint

landreev · 2023-09-27T23:27:05Z

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

+                                    @Context HttpHeaders headers) {
+        return response(req -> {
+            DatasetVersion datasetVersion = getDatasetVersionOrDie(req, version, findDatasetOrDie(dvIdtf), uriInfo, headers);
+            Long datasetStorageSize = ignoreOriginalTabularSize ? DatasetUtil.getDownloadSizeNumeric(datasetVersion, false)


Hmm. Backward compatibility is important... although I honestly have no idea how popular this api is (nobody seems to be using it on our instance here), and if it was originally implemented that way (to count both types) because somebody specifically requested it/had a reason to count both, or if it was just implemented like that unintentionally.

I'm ok with leaving its default behavior backward compatible...
However, I'm assuming that the new UI will work the same way as the current page - with the user having a choice between downloading either of the 2 types, but never both at the same time, correct? In this case we need this api to also be able to only count the original sizes. So, maybe instead of the ignoreOriginalTabularSize boolean, something like mode=(all|original|archival), with all being the default?

landreev · 2023-09-28T00:04:49Z

Aside from my earlier comment, my bigger concern - at least in the longer term is the performance.
Both the command and the DatasetUtil method rely on version.getFileMetadatas(); The way EJB normally handles this underneath, this is a very expensive operation for a dataset with many thousands of files. When this is done in the context a page bean, at least we can initialize the version and all its filemetadatas and datafiles once, and then iterate through them as needed. But in the context of an API call, it will be initializing all that stuff from scratch, just to produce this size.

I feel like the best solution would be to use custom database queries instead. I'm not insisting that this needs to be done in this PR, but we need to somehow document this as a todo that needs to be addressed. I'm assuming it will need to be addressed for the MVP - we do have monstrous datasets in our prod. with 25K and 40+K files.

GPortas · 2023-09-28T10:54:46Z

@landreev

Thank you for your detailed comments.

Maybe I have been too strict with backwards compatibility, which on this endpoint is not so important.

I agree with what you mention about performance and it is something that I have not considered at all and that I appreciate you mention it. I like the approach of using database queries + mode=(all|original|archival) and would like to at least evaluate the implementation for this PR. I will let you know.

Thank you!

landreev · 2023-09-28T15:00:52Z

From some quick archaeological digging, that "downloadsize" api was indeed added by request from an outside project, who wanted to use it in their own custom UI app they were building to work with Dataverse; and it was loosely matching their use case (they wanted to download everything, original and archival copies of the tab. files). It was obviously never anticipated, that we'd be building our own SPA UI that would want to use this api. I don't know if that project is still around, and if that backward compatibility is needed (but I can ask around). But, as I said, I'm still ok with preserving it, just in case.

pdurbin · 2023-09-28T17:07:44Z

added by request from an outside project

Yeah. The use case still makes sense at least:

"I'm an integrating system (Whole Tale, etc) and I want to provide some expectation in my UI about how long something will take when pulling files from Dataverse"

From #6524 (comment)

landreev · 2023-09-29T14:53:12Z

Just want to mention here that, instead of creating custom database queries for calculating these sizes, another alternative would be to try and utilize the Eclipse left join "hints", to avoid the followup datatbase queries on the datafiles and datatables; so, instead of doing version.getFileMetadatas() etc., executing something along the lines of:

TypedQuery<FileMetadata> query = em.createQuery("SELECT fm FROM FileMetadata fm WHERE ...", FileMetadata.class)
   .setParameter("datasetVersionId", versionId)
   .setHint("eclipselink.left-join-fetch", "fm.dataFile")
   .setHint("eclipselink.left-join-fetch", "fm.dataFile.dataTables");

etc. And that would look up all the datafiles and datatables in one query, without the N additional queries... And then we can iterate through the list as before. (There's a chance that you don't need the .setHint for fm.dataFile there - the second one may be enought to look up both).
In other words, the same approach as in DatasetServiceVersionBean.findDeep(), but only looking up the bare minimum of what we need for the filesize calculation.

The only way to find out which approach is better would be test and time it on a dataset with 10Ks of files. A custom query would almost certainly be faster; but the difference may or may not be significant. The left-join approach above may feel a little cleaner, but I'm assuming you pay an extra cost in memory.

landreev · 2023-09-29T15:09:03Z

I may need to take back the "bare minimum" part though - if we ever use that left join hint approach, we have to add hints for any other 1:1 relationships there - or else we are back to the EJB executing N additional queries for them...
Regardless, that was not really a practical suggestion for this specific use case, just wanted to document this alternative for other optimization situations.

landreev · 2023-10-02T17:40:37Z

FWIW - I have not analyzed the generated queries too much or benchmarked them, but you can see that JOINS are not used.

[edited:] Interesting, I never used QueryDSL myself, so was thinking of writing these queries myself. It definitely looks cleaner; and it's kind of neat, that you can write .where(fileMetadata.dataFile.dataTables.isEmpty()), and it just translates into
AND ((SELECT COUNT(t3.ID) FROM DATATABLE t3 WHERE (t3.DATAFILE_ID = t0.ID)) = 0)

In case you saw the first version of this comment, I was a little confused trying to read the screenshot above; I ended up building the branch and looking at the Postgres log, and playing with QueryDSL some more.

A minor thing, but, is the .where(fileMetadata.dataFile.dataTables.isNotEmpty() really necessary in the getOriginalTabularFilesSize() method? Shouldn't .where(dataTable.dataFile.eq(fileMetadata.dataFile)) be enough?

(I'm ready to approve the PR, looks good!)

landreev · 2023-10-02T17:58:36Z

src/test/java/edu/harvard/iq/dataverse/api/DatasetsIT.java

+        uploadTabularFileResponse.then().assertThat().statusCode(OK.getStatusCode());
+
+        // Get the original tabular file size
+        int tabularOriginalSize = Integer.parseInt(uploadTabularFileResponse.getBody().jsonPath().getString("data.files[0].dataFile.filesize"));


Minor thing, but why not just = 157? (the known size of src/test/resources/tab/test.tab).
It's a tiny file that should get ingested almost instantly. There may be a non-zero chance of the size in the response above already being the size of the generated archival file (I'm not sure... but it should be safest to use the known size, I think).

I agree. If it makes it safer, it's worth to update it.

landreev · 2023-10-02T19:30:22Z

Sorry, my bad - if you have read it already, please ignore my last comment about the query - I missed that AND (COUNT(...) = ...) clause in it. Will edit.

landreev · 2023-10-02T20:07:14Z

src/main/java/edu/harvard/iq/dataverse/DatasetVersionFilesServiceBean.java

+                .from(fileMetadata)
+                .where(fileMetadata.datasetVersion.id.eq(datasetVersion.getId()))
+                .from(dataTable)
+                .where(fileMetadata.dataFile.dataTables.isNotEmpty().and(dataTable.dataFile.eq(fileMetadata.dataFile)))


Yes, .where(dataTable.dataFile.eq(fileMetadata.dataFile))
should be sufficient - and it simplifies the resulting query a bit.

…ing it from response

…SS/dataverse into 9958-dataset-files-size

github-actions · 2023-10-03T09:24:36Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9958-dataset-files-size

ghcr.io/gdcc/configbaker:9958-dataset-files-size

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev

Perfect, thank you!

landreev · 2023-10-04T14:48:53Z

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

+                return error(Response.Status.BAD_REQUEST, "Invalid mode: " + mode);
+            }
+            DatasetVersion datasetVersion = getDatasetVersionOrDie(req, version, findDatasetOrDie(dvIdtf), uriInfo, headers);
+            long datasetStorageSize = datasetVersionFilesServiceBean.getFilesDownloadSize(datasetVersion, fileDownloadSizeMode);


Hey, I'm wondering if I gave you bad advice, to move this size calculation out of the command. The command provided one extra function - the authorization (requiring the ViewUnpublishedDataset permission on draft versions); which we no longer appear to be enforcing in the above (?).
I can't imagine this information, this download size, being truly sensitive... but we apparently used to require that permission, so at the very least we need to confirm if it is necessary or not.
Sorry for the extra confusion.

(I'll ask on slack)

Uh, never mind? That authorization check is still being performed, inside the getDatasetVersionOrDie().

kcondon · 2023-10-11T15:30:18Z

@GPortas I am seeing the same download size for all variations of the endpoint.

GPortas · 2023-10-11T18:50:52Z

@GPortas I am seeing the same download size for all variations of the endpoint.

@kcondon I'm sorry! I forgot to update the PR description with the testing instructions for the latest changes.

Added: ignoreOriginalTabularSize optional query parameter to getDownl…

3dd4564

…oadSize datasets API endpoint

This comment has been minimized.

Sign in to view

Changed: dataset version download size calculation when ignoring orig…

f653c21

…inal tab file sizes

GPortas commented Sep 27, 2023

View reviewed changes

GPortas added 2 commits September 27, 2023 10:40

Added: #9958 release notes

9d10b99

Added: mentioned ignoreOriginalTabularSize query parameter in the doc…

9710c79

…s for /downloadsize API endpoint

This comment has been minimized.

Sign in to view

Added: ignoreOriginalTabularSize query param usage example to the docs

4aa34ff

GPortas marked this pull request as ready for review September 27, 2023 09:54

This comment has been minimized.

Sign in to view

GPortas mentioned this pull request Sep 27, 2023

Add GetDatasetFilesTotalDownloadSize use case IQSS/dataverse-client-javascript#92

Merged

GPortas added the Size: 3 A percentage of a sprint. 2.1 hours. label Sep 27, 2023

landreev reviewed Sep 27, 2023

View reviewed changes

landreev self-assigned this Sep 27, 2023

GPortas self-assigned this Sep 28, 2023

GPortas added the SPA These changes are required for the Dataverse SPA label Sep 29, 2023

GPortas added 2 commits October 2, 2023 10:48

Changed: using query-based implementation for files download size

907fd40

Added: error case to getDownloadSize IT

a5c32bd

This comment has been minimized.

Sign in to view

GPortas removed their assignment Oct 2, 2023

MellyGray mentioned this pull request Oct 2, 2023

Integration - Get total dataset files size IQSS/dataverse-frontend#170

Closed

landreev reviewed Oct 2, 2023

View reviewed changes

GPortas added 2 commits October 3, 2023 09:49

Refactor: simpler where condition for getOriginalTabularFilesSize query

4561e83

Changed: using the known size of the tab file in IT instead of obtain…

47da5a4

…ing it from response

GPortas requested a review from landreev October 3, 2023 08:56

GPortas added 2 commits October 3, 2023 10:06

Merge branch '9852-files-api-deaccessioned-datasets' of github.com:IQ…

b76576d

…SS/dataverse into 9958-dataset-files-size

Merge branch '9852-files-api-deaccessioned-datasets' of github.com:IQ…

abcc962

…SS/dataverse into 9958-dataset-files-size

landreev approved these changes Oct 3, 2023

View reviewed changes

kcondon assigned kcondon and unassigned landreev Oct 4, 2023

landreev reviewed Oct 4, 2023

View reviewed changes

GPortas mentioned this pull request Oct 10, 2023

Get file counts use case with search criteria IQSS/dataverse-client-javascript#93

Merged

MellyGray mentioned this pull request Oct 10, 2023

172 - Integration add FileCriteria to the getFilesCountInfo use case IQSS/dataverse-frontend#196

Merged

GPortas mentioned this pull request Oct 12, 2023

[Spike - API] Extend the dataset total file download size endpoint to return counts based on selected filters and search criteria #9995

Closed

kcondon merged commit 3d3a553 into 9852-files-api-deaccessioned-datasets Oct 12, 2023
11 of 12 checks passed

kcondon deleted the 9958-dataset-files-size branch October 12, 2023 14:05

pdurbin added this to the 6.1 milestone Oct 13, 2023

pdurbin mentioned this pull request Dec 11, 2023

[Spike - API] Extend API to get files download size without original tabular files size #9958

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New API option to get the download size of the dataset version files without the original tabular files size #9960

New API option to get the download size of the dataset version files without the original tabular files size #9960

GPortas commented Sep 26, 2023 •

edited

Loading

This comment has been minimized.

GPortas Sep 27, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev Sep 27, 2023 •

edited

Loading

landreev commented Sep 28, 2023

GPortas commented Sep 28, 2023

landreev commented Sep 28, 2023

pdurbin commented Sep 28, 2023

landreev commented Sep 29, 2023

landreev commented Sep 29, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Oct 2, 2023 •

edited

Loading

landreev Oct 2, 2023

GPortas Oct 3, 2023

landreev commented Oct 2, 2023

landreev Oct 2, 2023

github-actions bot commented Oct 3, 2023

landreev left a comment

landreev Oct 4, 2023

landreev Oct 4, 2023

landreev Oct 4, 2023

kcondon commented Oct 11, 2023

GPortas commented Oct 11, 2023 •

edited

Loading

New API option to get the download size of the dataset version files without the original tabular files size #9960

New API option to get the download size of the dataset version files without the original tabular files size #9960

Conversation

GPortas commented Sep 26, 2023 • edited Loading

This comment has been minimized.

GPortas Sep 27, 2023

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

landreev commented Sep 28, 2023

GPortas commented Sep 28, 2023

landreev commented Sep 28, 2023

pdurbin commented Sep 28, 2023

landreev commented Sep 29, 2023

landreev commented Sep 29, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Oct 2, 2023 • edited Loading

landreev Oct 2, 2023

Choose a reason for hiding this comment

GPortas Oct 3, 2023

Choose a reason for hiding this comment

landreev commented Oct 2, 2023

landreev Oct 2, 2023

Choose a reason for hiding this comment

github-actions bot commented Oct 3, 2023

landreev left a comment

Choose a reason for hiding this comment

landreev Oct 4, 2023

Choose a reason for hiding this comment

landreev Oct 4, 2023

Choose a reason for hiding this comment

landreev Oct 4, 2023

Choose a reason for hiding this comment

kcondon commented Oct 11, 2023

GPortas commented Oct 11, 2023 • edited Loading

GPortas commented Sep 26, 2023 •

edited

Loading

landreev Sep 27, 2023 •

edited

Loading

landreev commented Oct 2, 2023 •

edited

Loading

GPortas commented Oct 11, 2023 •

edited

Loading