Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versions API improvements (9763) #9883

Merged
merged 31 commits into from
Oct 18, 2023
Merged

Versions API improvements (9763) #9883

merged 31 commits into from
Oct 18, 2023

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Sep 6, 2023

What this PR does / why we need it:

(copy-and-pasting from the comments in the linked issue):

I'm addressing the performance issues in the versions api via a combination of the following approaches:

  • (optional) pagination is being added to /api/datasets/{id}/versions
  • a new flag includeFiles is added to both /api/datasets/{id}/versions and /api/datasets/{id}/versions/{vid} (true by default), providing an option to drop the file information from the output
  • when files are requested to be included, the filemetadatas are looked up using the "left joint fetch" optimizations that retrieve the information from the extended table tree in a single query, instead of allowing EJB to perform 1 SELECT query on each 1:1 relation for each FileMetadata entry. On a version with N filemetadatas this saves 3*N individual queries. This is not free resources-wise by any means (costs more memory primarily), but still appears to show measurable improvement on datasets with large numbers of versions and/or files.

The following real life datasets from IQSS prod. service are used in the sample tests below:

  • 5255036: 237 versions (!), ~2,400 files. Files are relatively sparsely spread between versions.
  • 4554342: 2 versions, 1 pub. 1 draft. ~25,000 files (all appear in both versions).
  • 2710927: a control "reasonable" dataset, 100 files, 3 published versions.

/api/datasets/{id}/versions:

dataset id # of versions # of files response size v5.14 response time new response time new size w/out files new time w/out files
5255036 237 2400 38MB 1m30s 48s 2MB 12s
4554342 2 25000 17MB 2m16s 12s 2.5K 1s
2710927 3 100 .2MB 1s <1s 26K <1s

/api/datasets/{id}/versions/{vid}:

dataset id version # of files in version response size v5.14 response time new response time new size w/out files new time w/out files
5255036 87.1 12 24K 1s <1s 12K <1s
4554342 draft 25000 17MB 2m17s 12s 4K <1s
2710927 3.0 100 78K 1s <1s 9K <1s

Note: /api/datasets/4554342/versions and /api/datasets/4554342/versions/draft produce the same amount of output because the former api was called without authentication, so only one, published version was included.

Note: tests above were run on the dedicated IQSS test system (not in prod., which is a beefier and faster system).

Which issue(s) this PR closes:

Closes #9763

Special notes for your reviewer:

Please take an extra look at the framework for caching the adjusted "embargo publication date" that I added, replacing the real time lookup that used to be in Dataset.getCitationDatet(), and especially the flyway script added in the PR.

Suggestions on how to test this:

The description above outlines the way I tested the branch during dev.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick comments.

doc/sphinx-guides/source/api/native-api.rst Outdated Show resolved Hide resolved
return embargoCitationDate;
}
}
// @todo: remove this commented-out code once/if the PR passes review - L.A.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good to remove commented out code.

Newest,
Oldest,
Size,
Type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize this from #9693. Does it matter which PR is merged first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer #9693 to be merged first, and then resolve the conflict in this branch (I added a database lookup optimization on top of the code introduced in that PR).

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

2 similar comments
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@pdurbin pdurbin self-assigned this Sep 7, 2023
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at standup, @landreev sounds happy to add some tests so I'm returning this to him. Otherwise, I'd say it's ready for QA.

@pdurbin pdurbin assigned landreev and unassigned pdurbin Sep 11, 2023
@github-actions

This comment has been minimized.

avoid conflict with V6.0.0.1__9599-guestbook-at-request.sql
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run the code but it looks good. Conflicts resolved. Tests added. Docs look good.

I did notice a sql script naming conflict so I bumped the version. And I merged the latest from develop.

I'm only holding off on moving this to QA because Jenkins tests are failing (the "fire off installer" step). I started a thread in Slack about it: https://iqss.slack.com/archives/C010LA04BCG/p1697206082190619

@github-actions

This comment has been minimized.

@pdurbin
Copy link
Member

pdurbin commented Oct 13, 2023

I just kicked this off to get a second opinion from our container-based API test runner: https://github.com/gdcc/api-test-runner/actions/runs/6509472756

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins is still failing but GitHub Actions passed: https://github.com/gdcc/api-test-runner/actions/runs/6509472756

Approved.

@pdurbin
Copy link
Member

pdurbin commented Oct 13, 2023

I ran Jenkins again. It's now saying [ERROR] DatasetsIT.getVersionFiles:3485 1 expectation failed.", "JSON path data[1].label doesn't match.", "Expected: test_1.txt", " Actual: test_3.txt"

@landreev should we be worried about this? From https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9883/17/consoleFull

@github-actions

This comment has been minimized.

1 similar comment
@github-actions
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9763-lookup-optimizations
ghcr.io/gdcc/configbaker:9763-lookup-optimizations

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're having some testing challenges within Jenkins at the moment...

... but I'm pretty sure they are unrelated to this PR. (Tests passed in GitHub Actions.) Moving to ready for QA.

@kcondon kcondon self-assigned this Oct 18, 2023
@kcondon kcondon merged commit d8d980b into develop Oct 18, 2023
16 checks passed
@kcondon kcondon deleted the 9763-lookup-optimizations branch October 18, 2023 14:30
@pdurbin pdurbin added this to the 6.1 milestone Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: Slow response for the versions API call with large number of files or versions
6 participants