Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change de-identification method for Sensitive ID export fields #35772

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nospame
Copy link
Contributor

@nospame nospame commented Feb 13, 2025

Product Description

This will result in a change to the appearance and a one-time shift of values for de-identified "Sensitive ID" fields in exports. A previous field that might have looked like "PG42OH9AH5" will now read as a UUID like "3754f1a2-7eb1-4e6b-8640-65d7550c9bbc".

Technical Summary

SAAS-16366
The main concept here is that, instead of using a consistent hashing algorithm for DeID'd fields, we'll create and store a UUID that can be looked up for any given value and domain. This way, the value will still stay consistent over time but cannot be determined by simply re-running the hashing process on a value. This is represented by DeIdMapping (also open to other suggestions on naming).

In order to add a model in the export app that could be used in the same place as other transform functions, I had to tweak how we access those functions in export/const.py. Previously these functions were all imported directly, but (perhaps predictably), adding a reference to a model from a django app within that app's own constants, would have resulted in circular imports. This is dealt with by adding get_transform_function and get_deid_transform_function to export/utils.py, which import the transform function at runtime using importlib. There may be a slight negative performance impact to this choice -- not sure if this is a major concern or how to mitigate, if so.

Safety Assurance

Safety story

This is a change that interacts with core export functionality, so there is some risk here. I've tested locally to see that marked fields are DeID'd correctly and that there is no other change to given exports. There is decent automated test coverage of the export app as a whole and would expect that to catch any major functional issues.

Automated test coverage

Added a small test suite for behavior of the get_deid method of DeIdMapping. The DEID_ID_TRANSFORM that now points to the new method is implicitly tested in a few places throughout hqcase/tests.

QA Plan

Technically this is not a huge change, but given the risk in changes to exports, I think it'll be worthwhile to have this get a pass by QA.

Migrations

  • The migrations in this code can be safely applied first independently of the code

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@dimagimon dimagimon added the reindex/migration Reindex or migration will be required during or before deploy label Feb 13, 2025
@nospame nospame force-pushed the em/deid-export-method branch from b774853 to 5a9f740 Compare February 13, 2025 19:20
Comment on lines +12 to +13
hashed_value = models.TextField(max_length=32)
deid = models.UUIDField(default=uuid.uuid4)
Copy link
Contributor Author

@nospame nospame Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble coming up with good names for these fields, with fairly little context on export code in general. Open to alternatives here. Maybe something closer to hashed_item and deid_value?

Comment on lines +38 to +40
if value is None:
# None is a de-identifiable value but needs a string to encode for lookup
value = ''
Copy link
Contributor Author

@nospame nospame Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a meaningful difference between an empty string and None in data this would be de-identifying.

Previously you would get the same output for the both None and the string 'None' and this seems better to me, but if there's any reason to retain the previous behavior I can just call str(value) instead.

@nospame nospame requested review from esoergel and gherceg February 13, 2025 22:30
@nospame nospame marked this pull request as ready for review February 13, 2025 23:10
@nospame nospame force-pushed the em/deid-export-method branch from 78eab85 to 29a676a Compare February 14, 2025 17:29
@nospame nospame changed the title Change de-identification method for Sensitive ID fields Change de-identification method for Sensitive ID export fields Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reindex/migration Reindex or migration will be required during or before deploy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants