Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New API option to get the download size of the dataset version files without the original tabular files size #9960

Merged
merged 15 commits into from
Oct 12, 2023

Conversation

GPortas
Copy link
Contributor

@GPortas GPortas commented Sep 26, 2023

What this PR does / why we need it:

Added a new optional query parameter "mode" to the "getDownloadSize" API endpoint ("api/datasets/{identifier}/versions/{versionId}/downloadsize").

This parameter applies a filter criteria to the operation and supports the following values:

  • All (Default): Includes both archival and original sizes for tabular files

  • Archival: Includes only the archival size for tabular files

  • Original: Includes only the original size for tabular files

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this:

Upload tabular files to a dataset and test the endpoint via curl.

Ignoring the tabular original sizes:

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=Archival"

Including the tabular original and archival sizes (current behavior):

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize"

The above output should be the same as for:

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=All"

Count only the original sizes (ignores archival size for tab files):

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=Original"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

Yes

Additional documentation:

No

@github-actions

This comment has been minimized.

if (dataset == null) {
// should never happen - must indicate some data corruption in the database
throw new CommandException(BundleUtil.getStringFromBundle("datasets.api.listing.error"), this);
}

logger.fine("getDataverseStorageSize called on " + dataset.getDisplayName());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this order to avoid NPE when logging.

@github-actions

This comment has been minimized.

@GPortas GPortas marked this pull request as ready for review September 27, 2023 09:54
@github-actions

This comment has been minimized.

2 similar comments
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@GPortas GPortas added the Size: 3 A percentage of a sprint. 2.1 hours. label Sep 27, 2023
@Context HttpHeaders headers) {
return response(req -> {
DatasetVersion datasetVersion = getDatasetVersionOrDie(req, version, findDatasetOrDie(dvIdtf), uriInfo, headers);
Long datasetStorageSize = ignoreOriginalTabularSize ? DatasetUtil.getDownloadSizeNumeric(datasetVersion, false)
Copy link
Contributor

@landreev landreev Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Backward compatibility is important... although I honestly have no idea how popular this api is (nobody seems to be using it on our instance here), and if it was originally implemented that way (to count both types) because somebody specifically requested it/had a reason to count both, or if it was just implemented like that unintentionally.

I'm ok with leaving its default behavior backward compatible...
However, I'm assuming that the new UI will work the same way as the current page - with the user having a choice between downloading either of the 2 types, but never both at the same time, correct? In this case we need this api to also be able to only count the original sizes. So, maybe instead of the ignoreOriginalTabularSize boolean, something like mode=(all|original|archival), with all being the default?

@landreev landreev self-assigned this Sep 27, 2023
@landreev
Copy link
Contributor

Aside from my earlier comment, my bigger concern - at least in the longer term is the performance.
Both the command and the DatasetUtil method rely on version.getFileMetadatas(); The way EJB normally handles this underneath, this is a very expensive operation for a dataset with many thousands of files. When this is done in the context a page bean, at least we can initialize the version and all its filemetadatas and datafiles once, and then iterate through them as needed. But in the context of an API call, it will be initializing all that stuff from scratch, just to produce this size.

I feel like the best solution would be to use custom database queries instead. I'm not insisting that this needs to be done in this PR, but we need to somehow document this as a todo that needs to be addressed. I'm assuming it will need to be addressed for the MVP - we do have monstrous datasets in our prod. with 25K and 40+K files.

@GPortas GPortas self-assigned this Sep 28, 2023
@GPortas
Copy link
Contributor Author

GPortas commented Sep 28, 2023

@landreev

Thank you for your detailed comments.

Maybe I have been too strict with backwards compatibility, which on this endpoint is not so important.

I agree with what you mention about performance and it is something that I have not considered at all and that I appreciate you mention it. I like the approach of using database queries + mode=(all|original|archival) and would like to at least evaluate the implementation for this PR. I will let you know.

Thank you!

@landreev
Copy link
Contributor

From some quick archaeological digging, that "downloadsize" api was indeed added by request from an outside project, who wanted to use it in their own custom UI app they were building to work with Dataverse; and it was loosely matching their use case (they wanted to download everything, original and archival copies of the tab. files). It was obviously never anticipated, that we'd be building our own SPA UI that would want to use this api. I don't know if that project is still around, and if that backward compatibility is needed (but I can ask around). But, as I said, I'm still ok with preserving it, just in case.

@pdurbin
Copy link
Member

pdurbin commented Sep 28, 2023

added by request from an outside project

Yeah. The use case still makes sense at least:

"I'm an integrating system (Whole Tale, etc) and I want to provide some expectation in my UI about how long something will take when pulling files from Dataverse"

From #6524 (comment)

@GPortas GPortas added the SPA These changes are required for the Dataverse SPA label Sep 29, 2023
@landreev
Copy link
Contributor

Just want to mention here that, instead of creating custom database queries for calculating these sizes, another alternative would be to try and utilize the Eclipse left join "hints", to avoid the followup datatbase queries on the datafiles and datatables; so, instead of doing version.getFileMetadatas() etc., executing something along the lines of:

TypedQuery<FileMetadata> query = em.createQuery("SELECT fm FROM FileMetadata fm WHERE ...", FileMetadata.class)
   .setParameter("datasetVersionId", versionId)
   .setHint("eclipselink.left-join-fetch", "fm.dataFile")
   .setHint("eclipselink.left-join-fetch", "fm.dataFile.dataTables");

etc. And that would look up all the datafiles and datatables in one query, without the N additional queries... And then we can iterate through the list as before. (There's a chance that you don't need the .setHint for fm.dataFile there - the second one may be enought to look up both).
In other words, the same approach as in DatasetServiceVersionBean.findDeep(), but only looking up the bare minimum of what we need for the filesize calculation.

The only way to find out which approach is better would be test and time it on a dataset with 10Ks of files. A custom query would almost certainly be faster; but the difference may or may not be significant. The left-join approach above may feel a little cleaner, but I'm assuming you pay an extra cost in memory.

@landreev
Copy link
Contributor

I may need to take back the "bare minimum" part though - if we ever use that left join hint approach, we have to add hints for any other 1:1 relationships there - or else we are back to the EJB executing N additional queries for them...
Regardless, that was not really a practical suggestion for this specific use case, just wanted to document this alternative for other optimization situations.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@landreev
Copy link
Contributor

landreev commented Oct 2, 2023

FWIW - I have not analyzed the generated queries too much or benchmarked them, but you can see that JOINS are not used.

[edited:] Interesting, I never used QueryDSL myself, so was thinking of writing these queries myself. It definitely looks cleaner; and it's kind of neat, that you can write .where(fileMetadata.dataFile.dataTables.isEmpty()), and it just translates into
AND ((SELECT COUNT(t3.ID) FROM DATATABLE t3 WHERE (t3.DATAFILE_ID = t0.ID)) = 0)

In case you saw the first version of this comment, I was a little confused trying to read the screenshot above; I ended up building the branch and looking at the Postgres log, and playing with QueryDSL some more.

A minor thing, but, is the .where(fileMetadata.dataFile.dataTables.isNotEmpty() really necessary in the getOriginalTabularFilesSize() method? Shouldn't .where(dataTable.dataFile.eq(fileMetadata.dataFile)) be enough?

(I'm ready to approve the PR, looks good!)

uploadTabularFileResponse.then().assertThat().statusCode(OK.getStatusCode());

// Get the original tabular file size
int tabularOriginalSize = Integer.parseInt(uploadTabularFileResponse.getBody().jsonPath().getString("data.files[0].dataFile.filesize"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but why not just = 157? (the known size of src/test/resources/tab/test.tab).
It's a tiny file that should get ingested almost instantly. There may be a non-zero chance of the size in the response above already being the size of the generated archival file (I'm not sure... but it should be safest to use the known size, I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. If it makes it safer, it's worth to update it.

@landreev
Copy link
Contributor

landreev commented Oct 2, 2023

Sorry, my bad - if you have read it already, please ignore my last comment about the query - I missed that AND (COUNT(...) = ...) clause in it. Will edit.

.from(fileMetadata)
.where(fileMetadata.datasetVersion.id.eq(datasetVersion.getId()))
.from(dataTable)
.where(fileMetadata.dataFile.dataTables.isNotEmpty().and(dataTable.dataFile.eq(fileMetadata.dataFile)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, .where(dataTable.dataFile.eq(fileMetadata.dataFile))
should be sufficient - and it simplifies the resulting query a bit.

@GPortas GPortas requested a review from landreev October 3, 2023 08:56
@github-actions
Copy link

github-actions bot commented Oct 3, 2023

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9958-dataset-files-size
ghcr.io/gdcc/configbaker:9958-dataset-files-size

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you!

@kcondon kcondon assigned kcondon and unassigned landreev Oct 4, 2023
return error(Response.Status.BAD_REQUEST, "Invalid mode: " + mode);
}
DatasetVersion datasetVersion = getDatasetVersionOrDie(req, version, findDatasetOrDie(dvIdtf), uriInfo, headers);
long datasetStorageSize = datasetVersionFilesServiceBean.getFilesDownloadSize(datasetVersion, fileDownloadSizeMode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I'm wondering if I gave you bad advice, to move this size calculation out of the command. The command provided one extra function - the authorization (requiring the ViewUnpublishedDataset permission on draft versions); which we no longer appear to be enforcing in the above (?).
I can't imagine this information, this download size, being truly sensitive... but we apparently used to require that permission, so at the very least we need to confirm if it is necessary or not.
Sorry for the extra confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll ask on slack)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, never mind? That authorization check is still being performed, inside the getDatasetVersionOrDie().

@kcondon
Copy link
Contributor

kcondon commented Oct 11, 2023

@GPortas I am seeing the same download size for all variations of the endpoint.

@GPortas
Copy link
Contributor Author

GPortas commented Oct 11, 2023

@GPortas I am seeing the same download size for all variations of the endpoint.

@kcondon I'm sorry! I forgot to update the PR description with the testing instructions for the latest changes.

@kcondon kcondon merged commit 3d3a553 into 9852-files-api-deaccessioned-datasets Oct 12, 2023
11 of 12 checks passed
@kcondon kcondon deleted the 9958-dataset-files-size branch October 12, 2023 14:05
@pdurbin pdurbin added this to the 6.1 milestone Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours. SPA These changes are required for the Dataverse SPA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants