Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Adjust transformation deletes to handle cascading deletes #103

Merged
merged 48 commits into from
Jul 8, 2024

Conversation

chouinar
Copy link
Collaborator

Summary

Note this is a duplicate of HHS#2000 - just want to pull it into this repo first

Time to review: 5 mins

Changes proposed

Updates the transformation code to handle a case where a parent record (ie. opportunity or opportunity_summary) is deleted, AND the child records (everything else) is marked to be deleted as well.

Also added a new way to set metrics that handles adding more specific prefixed ones (eg. total_records_processed and opportunity.total_records_processed) - will expand more on this later.

Context for reviewers

Imagine a scenario an opportunity with a summary (synopsis) and a few applicant types gets deleted. The update process for loading from Oracle will mark all of our staging table records for those as is_deleted=True. When we go to process, we'll first process the opportunity, and delete it uneventfully, however we have cascade-deletes setup. This means that all of the children (the opportunity summary, and assistance listing tables among many others) also need to be deleted. SQLAlchemy handles this for us.

However, this means when we then start processing the synopsis record that was marked as deleted - we would error and say "I can't delete something that doesn't exist". To work around this, we're okay with these orphan deletes, and we just assume we already took care of it.

Additional information

To further test this, I loaded a subset of the prod data locally (~2500 opportunities, 35k records total). I then marked all of the data is is_deleted=True, transformed_at=null and ran it again. It went through the opportunities deleting them. When it got to the other tables, it didn't have to do very much as they all hit the new case. The metrics produced look like:

total_records_processed=37002
total_records_deleted=2453
total_delete_orphans_skipped=34549
total_error_count=0

opportunity.total_records_processed=2453
opportunity.total_records_deleted=2453

assistance_listing.total_records_processed=3814
assistance_listing.total_delete_orphans_skipped=3814

opportunity_summary.total_records_processed=3827
opportunity_summary.total_delete_orphans_skipped=3827

applicant_type.total_records_processed=17547
applicant_type.total_delete_orphans_skipped=17547

funding_category.total_records_processed=4947
funding_category.total_delete_orphans_skipped=4947

funding_instrument.total_records_processed=4414
funding_instrument.total_delete_orphans_skipped=4414

And as a sanity check, running again processes nothing.

Copy link
Collaborator

@coilysiren coilysiren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked questions about things I don't understand! Can you @ me in the replies so I get the notifications 🙏🏼

Comment on lines +103 to +104
source: S,
target: D | None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what's going on with this single letter vars 😵 I even looked into TypeVars briefly, and unfortunately, I still don't understand. But I did see that single letter vars are a common practice when defining TypeVars, so I don't think there's anything to change here.

Copy link
Collaborator Author

@chouinar chouinar Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in this particular function, they're really just serving as as type alias - rather than doing something like target: Opportunity | OpportunitySummary | etc

TypeVars are mostly used to handle implementing generics. For example, if you implemented your own dictionary class, you'd likely define TypeVars of K and V.

The main value of using these is so the type checker can follow your types, in the case of a dictionary, the "get" method would return V which the type checker would be able to infer based on how you setup the instance of the class object.


Not quite in this PR, but I have a fetch method with typing like:

def fetch(
        self, source_model: Type[S], destination_model: Type[D], join_clause: Sequence
    ) -> list[Tuple[S, D | None]]:

Effectively saying that for whatever type you pass in for the source and destination models, a list of tuples will be returned with objects of those types. In this case, if you called with:

value = fetch(TOpportunity, Opportunity, [])
# a valid value could be [ (Topportunity(...), Opportunity(...)), (Topportunity(...), None)]

# it, we'd hit this case, which isn't a problem.
logger.info("Cannot delete %s record as it does not exist", record_type, extra=extra)
source.transformation_notes = ORPHANED_DELETE_RECORD
self.increment(self.Metrics.TOTAL_DELETE_ORPHANS_SKIPPED, prefix=record_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate these metrics 👍🏼

Comment on lines +299 to +302
if source_assistance_listing.is_deleted:
self._handle_delete(
source_assistance_listing, target_assistance_listing, ASSISTANCE_LISTING, extra
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand these lines. It looks like it's saying

# if source listing is already deleted
# delete it again

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind, I just needed to read the PR description ^^

I feel like is_deleted means something more like to_be_deleted or marked_for_deletion ? idk. That's out of scope for this PR though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the column is is_deleted which makes sense in the context of the table (it is or isn't deleted), but in the context of processing those records does read a bit weird. I could alias the field name using a python property, but that might be more confusing as you'd need to trace where it comes from.

extra = transform_util.get_log_extra_funding_instrument(source_funding_instrument)
logger.info("Processing funding instrument", extra=extra)

if source_funding_instrument.is_deleted:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you're moving this case higher in the if elif statements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _handle_delete function handles the case where something simultaneously hits both the is_deleted and orphaned record case at once.

In reality the if/elif-statements are:

  • is deleted + orphaned (inside _handle_delete)
  • is deleted (inside _handle_delete)
  • orphaned + historical
  • insert

Comment on lines +66 to +68
if prefix is not None:
# Rather than re-implement the above, just re-use the function without a prefix
self.increment(f"{prefix}.{name}", value, prefix=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure I understand this code correctly. When you add a metric with a prefix, you are actually creating two metrics. One with the prefix, and one without, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that way we can have high-level and granular metrics all in one.

So if we were working with a metric like records_processed, and were working with record types a, b and c we could end up with something like:

records_processed -> 10
a_records_processed -> 5
b_records_processed -> 3
c_records_processed -> 2

Without needing several separate calls to the function.

@chouinar chouinar merged commit 9bd7bbf into main Jul 8, 2024
8 checks passed
@chouinar chouinar deleted the chouinar/1983-cascade-deletes branch July 8, 2024 15:27
acouch pushed a commit that referenced this pull request Sep 18, 2024
Note this is a duplicate of
HHS#2000 - just want to pull
it into this repo first

Updates the transformation code to handle a case where a parent record
(ie. opportunity or opportunity_summary) is deleted, AND the child
records (everything else) is marked to be deleted as well.

Also added a new way to set metrics that handles adding more specific
prefixed ones (eg. `total_records_processed` and
`opportunity.total_records_processed`) - will expand more on this later.

Imagine a scenario an opportunity with a summary (synopsis) and a few
applicant types gets deleted. The update process for loading from Oracle
will mark all of our staging table records for those as
`is_deleted=True`. When we go to process, we'll first process the
opportunity, and delete it uneventfully, however we have cascade-deletes
setup. This means that all of the children (the opportunity summary, and
assistance listing tables among many others) also need to be deleted.
SQLAlchemy handles this for us.

However, this means when we then start processing the synopsis record
that was marked as deleted - we would error and say "I can't delete
something that doesn't exist". To work around this, we're okay with
these orphan deletes, and we just assume we already took care of it.

To further test this, I loaded a subset of the prod data locally (~2500
opportunities, 35k records total). I then marked all of the data is
`is_deleted=True, transformed_at=null` and ran it again. It went through
the opportunities deleting them. When it got to the other tables, it
didn't have to do very much as they all hit the new case. The metrics
produced look like:

```
total_records_processed=37002
total_records_deleted=2453
total_delete_orphans_skipped=34549
total_error_count=0

opportunity.total_records_processed=2453
opportunity.total_records_deleted=2453

assistance_listing.total_records_processed=3814
assistance_listing.total_delete_orphans_skipped=3814

opportunity_summary.total_records_processed=3827
opportunity_summary.total_delete_orphans_skipped=3827

applicant_type.total_records_processed=17547
applicant_type.total_delete_orphans_skipped=17547

funding_category.total_records_processed=4947
funding_category.total_delete_orphans_skipped=4947

funding_instrument.total_records_processed=4414
funding_instrument.total_delete_orphans_skipped=4414
```
And as a sanity check, running again processes nothing.

---------

Co-authored-by: nava-platform-bot <[email protected]>
acouch pushed a commit that referenced this pull request Sep 18, 2024
Note this is a duplicate of
HHS#2000 - just want to pull
it into this repo first

Updates the transformation code to handle a case where a parent record
(ie. opportunity or opportunity_summary) is deleted, AND the child
records (everything else) is marked to be deleted as well.

Also added a new way to set metrics that handles adding more specific
prefixed ones (eg. `total_records_processed` and
`opportunity.total_records_processed`) - will expand more on this later.

Imagine a scenario an opportunity with a summary (synopsis) and a few
applicant types gets deleted. The update process for loading from Oracle
will mark all of our staging table records for those as
`is_deleted=True`. When we go to process, we'll first process the
opportunity, and delete it uneventfully, however we have cascade-deletes
setup. This means that all of the children (the opportunity summary, and
assistance listing tables among many others) also need to be deleted.
SQLAlchemy handles this for us.

However, this means when we then start processing the synopsis record
that was marked as deleted - we would error and say "I can't delete
something that doesn't exist". To work around this, we're okay with
these orphan deletes, and we just assume we already took care of it.

To further test this, I loaded a subset of the prod data locally (~2500
opportunities, 35k records total). I then marked all of the data is
`is_deleted=True, transformed_at=null` and ran it again. It went through
the opportunities deleting them. When it got to the other tables, it
didn't have to do very much as they all hit the new case. The metrics
produced look like:

```
total_records_processed=37002
total_records_deleted=2453
total_delete_orphans_skipped=34549
total_error_count=0

opportunity.total_records_processed=2453
opportunity.total_records_deleted=2453

assistance_listing.total_records_processed=3814
assistance_listing.total_delete_orphans_skipped=3814

opportunity_summary.total_records_processed=3827
opportunity_summary.total_delete_orphans_skipped=3827

applicant_type.total_records_processed=17547
applicant_type.total_delete_orphans_skipped=17547

funding_category.total_records_processed=4947
funding_category.total_delete_orphans_skipped=4947

funding_instrument.total_records_processed=4414
funding_instrument.total_delete_orphans_skipped=4414
```
And as a sanity check, running again processes nothing.

---------

Co-authored-by: nava-platform-bot <[email protected]>
acouch pushed a commit to HHS/simpler-grants-gov that referenced this pull request Sep 18, 2024
Note this is a duplicate of
#2000 - just want to pull
it into this repo first

Updates the transformation code to handle a case where a parent record
(ie. opportunity or opportunity_summary) is deleted, AND the child
records (everything else) is marked to be deleted as well.

Also added a new way to set metrics that handles adding more specific
prefixed ones (eg. `total_records_processed` and
`opportunity.total_records_processed`) - will expand more on this later.

Imagine a scenario an opportunity with a summary (synopsis) and a few
applicant types gets deleted. The update process for loading from Oracle
will mark all of our staging table records for those as
`is_deleted=True`. When we go to process, we'll first process the
opportunity, and delete it uneventfully, however we have cascade-deletes
setup. This means that all of the children (the opportunity summary, and
assistance listing tables among many others) also need to be deleted.
SQLAlchemy handles this for us.

However, this means when we then start processing the synopsis record
that was marked as deleted - we would error and say "I can't delete
something that doesn't exist". To work around this, we're okay with
these orphan deletes, and we just assume we already took care of it.

To further test this, I loaded a subset of the prod data locally (~2500
opportunities, 35k records total). I then marked all of the data is
`is_deleted=True, transformed_at=null` and ran it again. It went through
the opportunities deleting them. When it got to the other tables, it
didn't have to do very much as they all hit the new case. The metrics
produced look like:

```
total_records_processed=37002
total_records_deleted=2453
total_delete_orphans_skipped=34549
total_error_count=0

opportunity.total_records_processed=2453
opportunity.total_records_deleted=2453

assistance_listing.total_records_processed=3814
assistance_listing.total_delete_orphans_skipped=3814

opportunity_summary.total_records_processed=3827
opportunity_summary.total_delete_orphans_skipped=3827

applicant_type.total_records_processed=17547
applicant_type.total_delete_orphans_skipped=17547

funding_category.total_records_processed=4947
funding_category.total_delete_orphans_skipped=4947

funding_instrument.total_records_processed=4414
funding_instrument.total_delete_orphans_skipped=4414
```
And as a sanity check, running again processes nothing.

---------

Co-authored-by: nava-platform-bot <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants