Adjust transformation deletes to handle cascading deletes #103

chouinar · 2024-06-25T16:13:04Z

Summary

Note this is a duplicate of HHS#2000 - just want to pull it into this repo first

Time to review: 5 mins

Changes proposed

Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well.

Also added a new way to set metrics that handles adding more specific prefixed ones (eg. total_records_processed and opportunity.total_records_processed) - will expand more on this later.

Context for reviewers

Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as is_deleted=True. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us.

However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it.

Additional information

To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is is_deleted=True, transformed_at=null and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like:

total_records_processed=37002
total_records_deleted=2453
total_delete_orphans_skipped=34549
total_error_count=0

opportunity.total_records_processed=2453
opportunity.total_records_deleted=2453

assistance_listing.total_records_processed=3814
assistance_listing.total_delete_orphans_skipped=3814

opportunity_summary.total_records_processed=3827
opportunity_summary.total_delete_orphans_skipped=3827

applicant_type.total_records_processed=17547
applicant_type.total_delete_orphans_skipped=17547

funding_category.total_records_processed=4947
funding_category.total_delete_orphans_skipped=4947

funding_instrument.total_records_processed=4414
funding_instrument.total_delete_orphans_skipped=4414

And as a sanity check, running again processes nothing.

…ry records

…letes

coilysiren

Asked questions about things I don't understand! Can you @ me in the replies so I get the notifications 🙏🏼

coilysiren · 2024-07-01T17:36:26Z

api/src/data_migration/transformation/transform_oracle_data_task.py

+        source: S,
+        target: D | None,


I have no idea what's going on with this single letter vars 😵 I even looked into TypeVars briefly, and unfortunately, I still don't understand. But I did see that single letter vars are a common practice when defining TypeVars, so I don't think there's anything to change here.

So, in this particular function, they're really just serving as as type alias - rather than doing something like target: Opportunity | OpportunitySummary | etc

TypeVars are mostly used to handle implementing generics. For example, if you implemented your own dictionary class, you'd likely define TypeVars of K and V.

The main value of using these is so the type checker can follow your types, in the case of a dictionary, the "get" method would return V which the type checker would be able to infer based on how you setup the instance of the class object.

Not quite in this PR, but I have a fetch method with typing like:

def fetch( self, source_model: Type[S], destination_model: Type[D], join_clause: Sequence ) -> list[Tuple[S, D | None]]:

Effectively saying that for whatever type you pass in for the source and destination models, a list of tuples will be returned with objects of those types. In this case, if you called with:

value = fetch(TOpportunity, Opportunity, []) # a valid value could be [ (Topportunity(...), Opportunity(...)), (Topportunity(...), None)]

coilysiren · 2024-07-01T17:40:44Z

api/src/data_migration/transformation/transform_oracle_data_task.py

+            # it, we'd hit this case, which isn't a problem.
+            logger.info("Cannot delete %s record as it does not exist", record_type, extra=extra)
+            source.transformation_notes = ORPHANED_DELETE_RECORD
+            self.increment(self.Metrics.TOTAL_DELETE_ORPHANS_SKIPPED, prefix=record_type)


Appreciate these metrics 👍🏼

coilysiren · 2024-07-01T17:44:03Z

api/src/data_migration/transformation/transform_oracle_data_task.py

+        if source_assistance_listing.is_deleted:
+            self._handle_delete(
+                source_assistance_listing, target_assistance_listing, ASSISTANCE_LISTING, extra
+            )


I don't understand these lines. It looks like it's saying

# if source listing is already deleted # delete it again

?

Ah nevermind, I just needed to read the PR description ^^

I feel like is_deleted means something more like to_be_deleted or marked_for_deletion ? idk. That's out of scope for this PR though.

Yeah, the column is is_deleted which makes sense in the context of the table (it is or isn't deleted), but in the context of processing those records does read a bit weird. I could alias the field name using a python property, but that might be more confusing as you'd need to trace where it comes from.

coilysiren · 2024-07-01T17:48:21Z

api/src/data_migration/transformation/transform_oracle_data_task.py

        extra = transform_util.get_log_extra_funding_instrument(source_funding_instrument)
        logger.info("Processing funding instrument", extra=extra)

+        if source_funding_instrument.is_deleted:


Is there a specific reason you're moving this case higher in the if elif statements?

The _handle_delete function handles the case where something simultaneously hits both the is_deleted and orphaned record case at once.

In reality the if/elif-statements are:

is deleted + orphaned (inside _handle_delete)

is deleted (inside _handle_delete)

orphaned + historical

insert

coilysiren · 2024-07-01T17:49:54Z

api/src/task/task.py

+        if prefix is not None:
+            # Rather than re-implement the above, just re-use the function without a prefix
+            self.increment(f"{prefix}.{name}", value, prefix=None)


Making sure I understand this code correctly. When you add a metric with a prefix, you are actually creating two metrics. One with the prefix, and one without, right?

Yes, that way we can have high-level and granular metrics all in one.

So if we were working with a metric like records_processed, and were working with record types a, b and c we could end up with something like:

records_processed -> 10 a_records_processed -> 5 b_records_processed -> 3 c_records_processed -> 2

Without needing several separate calls to the function.

Note this is a duplicate of HHS#2000 - just want to pull it into this repo first Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well. Also added a new way to set metrics that handles adding more specific prefixed ones (eg. `total_records_processed` and `opportunity.total_records_processed`) - will expand more on this later. Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as `is_deleted=True`. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us. However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it. To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is `is_deleted=True, transformed_at=null` and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like: ``` total_records_processed=37002 total_records_deleted=2453 total_delete_orphans_skipped=34549 total_error_count=0 opportunity.total_records_processed=2453 opportunity.total_records_deleted=2453 assistance_listing.total_records_processed=3814 assistance_listing.total_delete_orphans_skipped=3814 opportunity_summary.total_records_processed=3827 opportunity_summary.total_delete_orphans_skipped=3827 applicant_type.total_records_processed=17547 applicant_type.total_delete_orphans_skipped=17547 funding_category.total_records_processed=4947 funding_category.total_delete_orphans_skipped=4947 funding_instrument.total_records_processed=4414 funding_instrument.total_delete_orphans_skipped=4414 ``` And as a sanity check, running again processes nothing. --------- Co-authored-by: nava-platform-bot <[email protected]>

Note this is a duplicate of #2000 - just want to pull it into this repo first Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well. Also added a new way to set metrics that handles adding more specific prefixed ones (eg. `total_records_processed` and `opportunity.total_records_processed`) - will expand more on this later. Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as `is_deleted=True`. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us. However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it. To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is `is_deleted=True, transformed_at=null` and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like: ``` total_records_processed=37002 total_records_deleted=2453 total_delete_orphans_skipped=34549 total_error_count=0 opportunity.total_records_processed=2453 opportunity.total_records_deleted=2453 assistance_listing.total_records_processed=3814 assistance_listing.total_delete_orphans_skipped=3814 opportunity_summary.total_records_processed=3827 opportunity_summary.total_delete_orphans_skipped=3827 applicant_type.total_records_processed=17547 applicant_type.total_delete_orphans_skipped=17547 funding_category.total_records_processed=4947 funding_category.total_delete_orphans_skipped=4947 funding_instrument.total_records_processed=4414 funding_instrument.total_delete_orphans_skipped=4414 ``` And as a sanity check, running again processes nothing. --------- Co-authored-by: nava-platform-bot <[email protected]>

chouinar added 30 commits April 18, 2024 15:05

WIP

71e6cf2

WIP

cfc8076

Merge branch 'main' into chouinar/1745-setup-transformations

65d9785

More organization

6ccd8e6

Merge branch 'main' into chouinar/1745-setup-transformations

345e65e

Adding more implementation

3741df5

Merge branch 'main' into chouinar/1745-setup-transformations

afb077b

Cleanup, still a WIP

2520062

adding a comment

5d4639e

Merge branch 'main' into chouinar/1745-setup-transformations

9dbbc1c

Switching to the staging tables

44abd4f

WIP

703672e

Merge branch 'main' into chouinar/1745-setup-transformations

68f3968

WIP

b5e1ebe

Merge branch 'main' into chouinar/1745-setup-transformations

94032b2

Final tests and cleanup

38ad477

Minor adjustment

e1dfff0

Merge branch 'main' into chouinar/1745-setup-transformations

fda6a89

[Issue HHS#1746] Add transformations for assistance listing table

54774fd

trim

7954c59

Adding more tests, cleanup, fixing warnings

fce4caf

[Issue HHS#1747] Add transformations for the opportunity summary table

588ec11

Adding a lot of implementation

fcf16ca

More tests and cleanup

f81f891

WIP

f73b97f

Merge branch 'main' into chouinar/1747-opportunity-summary-transform

2e00840

Cleanup

a60d72e

Adding a few final tests

52e2534

remove extra code

4bba72b

remove unused bit

fdb33c4

chouinar and others added 18 commits May 7, 2024 10:05

[Issue HHS#1749] Add transformations for the one-to-many lookup tables

c2f2966

Merge branch 'main' into chouinar/1749-one-to-many

2ee74db

WIP

14e09ac

Cleanup and final tests

ee5ab99

Minor adjustment

b114774

A few final fixes

870189e

Merge branch 'main' into chouinar/1749-one-to-many

9e958a2

Fix issues regarding transforms of one-to-many tables, cleanup

979fc7a

Adjust nesting structure of if-statements

c1a7426

Merge branch 'main' into chouinar/1749-one-to-many

0e90364

[Issue HHS#1977] Adjust transformation logic to handle orphaned histo…

20abd51

…ry records

Merge branch 'main' into chouinar/1977-transform-hist-fix

8b67199

Update database ERD diagrams

e49fdf4

Merge branch 'main' into chouinar/1977-transform-hist-fix

b9a2059

[Issue HHS#1983] Adjust transformation deletes to handle cascading de…

5cf45f5

…letes

Merge branch 'main' into chouinar/1983-cascade-deletes

a674cb9

spacing

1192b23

Merge branch 'main' into chouinar/1983-cascade-deletes

5e2f0ff

chouinar requested a review from jamesbursa as a code owner June 25, 2024 16:13

chouinar requested a review from acouch June 25, 2024 16:13

github-actions bot added api python labels Jun 25, 2024

coilysiren approved these changes Jul 1, 2024

View reviewed changes

chouinar merged commit 9bd7bbf into main Jul 8, 2024
8 checks passed

chouinar deleted the chouinar/1983-cascade-deletes branch July 8, 2024 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust transformation deletes to handle cascading deletes #103

Adjust transformation deletes to handle cascading deletes #103

chouinar commented Jun 25, 2024

coilysiren left a comment

coilysiren Jul 1, 2024

chouinar Jul 8, 2024 •

edited

Loading

coilysiren Jul 1, 2024

coilysiren Jul 1, 2024

coilysiren Jul 1, 2024

chouinar Jul 8, 2024

coilysiren Jul 1, 2024

chouinar Jul 8, 2024

coilysiren Jul 1, 2024

chouinar Jul 8, 2024

Adjust transformation deletes to handle cascading deletes #103

Adjust transformation deletes to handle cascading deletes #103

Conversation

chouinar commented Jun 25, 2024

Summary

Time to review: 5 mins

Changes proposed

Context for reviewers

Additional information

coilysiren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chouinar Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chouinar Jul 8, 2024 •

edited

Loading