Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database schema optimizations #4731

Open
wants to merge 127 commits into
base: main
Choose a base branch
from

Conversation

SergeyGaluzo
Copy link
Contributor

@SergeyGaluzo SergeyGaluzo commented Nov 19, 2024

Optimizations:

  1. History/current resources split with removal of not needed indexes.
  2. Raw resources in separate table.
  3. Resource id integer mapping.
  4. Enabled writes/reads of raw resources to/from ADLS.
  5. Moved string references to StringReferenceSearchParams table and extended id length to 768 B.

Deployment is dependent on small schema change creating CurrentResource view. #4755

https://microsofthealth.visualstudio.com/Health/_workitems/edit/135369

Question:
Can this PR be cut into 2 smaller ones?
Answer:
No, if we want to avoid re-writing Resource table twice. Here is why. There are 2 main schema optimizations: #1 moving historical resources into separate table, which allows to reduce number of indexes, and, hence, save space, and #2 resource id integer mapping. Though these optimizations are independent, both require changes in the way resources are stored. #1 requires adding not only history table but also extra table to store raw resources (this reduces update latency as old raw resource stays in the same place). #2 requires new ResourceIdIntMap table, and also changes main resource row, where ResourceId string is replaced by its integer mapped value ResourceIdInt.

Question:
This PR has ~13K lines changed. Why is it that big?
Answer:
FHIR solution is not using database projects:
#1 Currently even a single line change in a single stored procedure results in a full schema SQL script added to PR. This alone is 6.5K code lines.
#2 Modified, even slightly, stored procedures have to be added in full to the diff.sql (~2K lines).
Other reasons:
#1 Data migration requires intermediate stored procedures and views to access old and new schema. This almost doubles the total changed lines in diff.sql.
#2 FHIR uses generator for database wrapper classes. This generator does not work with views, so, to bypass this limitation, SQL scripts have "artificial" table statements to generate wrappers, which (tables) are immediately dropped, and "replaced" by view statements (same names, so table wrapper classes can be reused for views). This significantly contributes to changed lines.
#3 There are ~300 lines of code specific to read/write of raw resources from/to ADLS. This code is dormant until ADLS is enabled.
#4 There are ~936 lines of test code changes.

SergeyGaluzo and others added 30 commits September 27, 2024 08:42
* History separation V0

* Forgotten resource

* replace current by view

* TRUNCATE

* HardDelete

* delete history v0

* missed 64 and removed dead code from SQL query generator

* Removed history from current in merge

* Triggers

* tests and tran

* disable/enable indexes

* rollback hard delete and merge

* rollback delete invisible history

* TRUNCATE -> DELETE

* line

* leftovers

* PK check

* Cosmetic

* WHERE nanes

* Get resources with forced indexes

* Removed redundant where

* Adding feature flag for raw resource dedupping

* Enable invisible history by default

* parameters

* commit

* fixed typo

* 65 + more realistic diff

* adjusted update trigger

* right trigger

* next iteration

* special chars

* Added delete and update

* \r

* Adding script runner

* Added verification

* cosmetic

* HOLDLOCK hint

* removed incorrect comments

* RT

* lock timeout

* 180 and remove empty

* no changes in update resource search params

* blob rewriter tool

* Revert "no changes in update resource search params"

This reverts commit b2e0c38.

* comment

* Generic disable indexes and update search params

* Dummy resources

* fixes

* Dummy records based on surr id

* adjust test to filter dummy rows

* exclude history and current

* Adding PerfTest V-1

* Get asyn wrapper without type string to id function

* tool

* not exists on data copy

* packages back

* Added comments.

* Correct filtered index on resource current

* Added redundant IsHistory=0 to index

* deduping

* Removed dummy records

* pp-p

* history clause

* deduping

* testing parameters

* get resource by type and surr id range with many versions

* skip large databases

* Added calls to fhir

* added conflict

* max retries = 3

* database pings

* examples

* reverse

* start closer

* put logic

* Create

* Move 65 to 84

* Fixes after merge

* Adding RawResources table

* change capture tweeks

* Test fix

* Added MI

* inline insert

* Repeat on updates
@microsoft microsoft deleted a comment from azure-pipelines bot Jan 13, 2025
@gunjitchhhatwal
Copy link

gunjitchhhatwal commented Jan 14, 2025

5. Extended string reference id length.

@SergeyGaluzo - could you provide a description of this change?

@gunjitchhhatwal I updated #5 description.

@microsoft microsoft deleted a comment from gunjitchhhatwal Jan 19, 2025
@microsoft microsoft deleted a comment from azure-pipelines bot Jan 19, 2025
@microsoft microsoft deleted a comment from azure-pipelines bot Jan 19, 2025
@SergeyGaluzo
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Enhancement Enhancement on existing functionality. Schema Version backward compatible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants