-
Notifications
You must be signed in to change notification settings - Fork 0
Adjust transformation deletes to handle cascading deletes #103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asked questions about things I don't understand! Can you @ me in the replies so I get the notifications 🙏🏼
source: S, | ||
target: D | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what's going on with this single letter vars 😵 I even looked into TypeVars briefly, and unfortunately, I still don't understand. But I did see that single letter vars are a common practice when defining TypeVars, so I don't think there's anything to change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in this particular function, they're really just serving as as type alias - rather than doing something like target: Opportunity | OpportunitySummary | etc
TypeVars are mostly used to handle implementing generics. For example, if you implemented your own dictionary class, you'd likely define TypeVars of K
and V
.
The main value of using these is so the type checker can follow your types, in the case of a dictionary, the "get" method would return V
which the type checker would be able to infer based on how you setup the instance of the class object.
Not quite in this PR, but I have a fetch method with typing like:
def fetch(
self, source_model: Type[S], destination_model: Type[D], join_clause: Sequence
) -> list[Tuple[S, D | None]]:
Effectively saying that for whatever type you pass in for the source and destination models, a list of tuples will be returned with objects of those types. In this case, if you called with:
value = fetch(TOpportunity, Opportunity, [])
# a valid value could be [ (Topportunity(...), Opportunity(...)), (Topportunity(...), None)]
# it, we'd hit this case, which isn't a problem. | ||
logger.info("Cannot delete %s record as it does not exist", record_type, extra=extra) | ||
source.transformation_notes = ORPHANED_DELETE_RECORD | ||
self.increment(self.Metrics.TOTAL_DELETE_ORPHANS_SKIPPED, prefix=record_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate these metrics 👍🏼
if source_assistance_listing.is_deleted: | ||
self._handle_delete( | ||
source_assistance_listing, target_assistance_listing, ASSISTANCE_LISTING, extra | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand these lines. It looks like it's saying
# if source listing is already deleted
# delete it again
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nevermind, I just needed to read the PR description ^^
I feel like is_deleted
means something more like to_be_deleted
or marked_for_deletion
? idk. That's out of scope for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the column is is_deleted
which makes sense in the context of the table (it is or isn't deleted), but in the context of processing those records does read a bit weird. I could alias the field name using a python property, but that might be more confusing as you'd need to trace where it comes from.
extra = transform_util.get_log_extra_funding_instrument(source_funding_instrument) | ||
logger.info("Processing funding instrument", extra=extra) | ||
|
||
if source_funding_instrument.is_deleted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason you're moving this case higher in the if elif statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _handle_delete
function handles the case where something simultaneously hits both the is_deleted
and orphaned record case at once.
In reality the if/elif-statements are:
- is deleted + orphaned (inside
_handle_delete
) - is deleted (inside
_handle_delete
) - orphaned + historical
- insert
if prefix is not None: | ||
# Rather than re-implement the above, just re-use the function without a prefix | ||
self.increment(f"{prefix}.{name}", value, prefix=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure I understand this code correctly. When you add a metric with a prefix, you are actually creating two metrics. One with the prefix, and one without, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that way we can have high-level and granular metrics all in one.
So if we were working with a metric like records_processed
, and were working with record types a
, b
and c
we could end up with something like:
records_processed -> 10
a_records_processed -> 5
b_records_processed -> 3
c_records_processed -> 2
Without needing several separate calls to the function.
Note this is a duplicate of HHS#2000 - just want to pull it into this repo first Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well. Also added a new way to set metrics that handles adding more specific prefixed ones (eg. `total_records_processed` and `opportunity.total_records_processed`) - will expand more on this later. Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as `is_deleted=True`. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us. However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it. To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is `is_deleted=True, transformed_at=null` and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like: ``` total_records_processed=37002 total_records_deleted=2453 total_delete_orphans_skipped=34549 total_error_count=0 opportunity.total_records_processed=2453 opportunity.total_records_deleted=2453 assistance_listing.total_records_processed=3814 assistance_listing.total_delete_orphans_skipped=3814 opportunity_summary.total_records_processed=3827 opportunity_summary.total_delete_orphans_skipped=3827 applicant_type.total_records_processed=17547 applicant_type.total_delete_orphans_skipped=17547 funding_category.total_records_processed=4947 funding_category.total_delete_orphans_skipped=4947 funding_instrument.total_records_processed=4414 funding_instrument.total_delete_orphans_skipped=4414 ``` And as a sanity check, running again processes nothing. --------- Co-authored-by: nava-platform-bot <[email protected]>
Note this is a duplicate of HHS#2000 - just want to pull it into this repo first Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well. Also added a new way to set metrics that handles adding more specific prefixed ones (eg. `total_records_processed` and `opportunity.total_records_processed`) - will expand more on this later. Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as `is_deleted=True`. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us. However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it. To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is `is_deleted=True, transformed_at=null` and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like: ``` total_records_processed=37002 total_records_deleted=2453 total_delete_orphans_skipped=34549 total_error_count=0 opportunity.total_records_processed=2453 opportunity.total_records_deleted=2453 assistance_listing.total_records_processed=3814 assistance_listing.total_delete_orphans_skipped=3814 opportunity_summary.total_records_processed=3827 opportunity_summary.total_delete_orphans_skipped=3827 applicant_type.total_records_processed=17547 applicant_type.total_delete_orphans_skipped=17547 funding_category.total_records_processed=4947 funding_category.total_delete_orphans_skipped=4947 funding_instrument.total_records_processed=4414 funding_instrument.total_delete_orphans_skipped=4414 ``` And as a sanity check, running again processes nothing. --------- Co-authored-by: nava-platform-bot <[email protected]>
Note this is a duplicate of #2000 - just want to pull it into this repo first Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well. Also added a new way to set metrics that handles adding more specific prefixed ones (eg. `total_records_processed` and `opportunity.total_records_processed`) - will expand more on this later. Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as `is_deleted=True`. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us. However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it. To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is `is_deleted=True, transformed_at=null` and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like: ``` total_records_processed=37002 total_records_deleted=2453 total_delete_orphans_skipped=34549 total_error_count=0 opportunity.total_records_processed=2453 opportunity.total_records_deleted=2453 assistance_listing.total_records_processed=3814 assistance_listing.total_delete_orphans_skipped=3814 opportunity_summary.total_records_processed=3827 opportunity_summary.total_delete_orphans_skipped=3827 applicant_type.total_records_processed=17547 applicant_type.total_delete_orphans_skipped=17547 funding_category.total_records_processed=4947 funding_category.total_delete_orphans_skipped=4947 funding_instrument.total_records_processed=4414 funding_instrument.total_delete_orphans_skipped=4414 ``` And as a sanity check, running again processes nothing. --------- Co-authored-by: nava-platform-bot <[email protected]>
Summary
Note this is a duplicate of HHS#2000 - just want to pull it into this repo first
Time to review: 5 mins
Changes proposed
Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well.
Also added a new way to set metrics that handles adding more specific prefixed ones (eg.
total_records_processed
andopportunity.total_records_processed
) - will expand more on this later.Context for reviewers
Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as
is_deleted=True
. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us.However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it.
Additional information
To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is
is_deleted=True, transformed_at=null
and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like:And as a sanity check, running again processes nothing.