-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change de-identification method for Sensitive ID export fields #35772
base: master
Are you sure you want to change the base?
Changes from all commits
00771bb
fa24fd6
2f2b162
5a9f740
5e72079
0d322db
29a676a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Generated by Django 4.2.17 on 2025-02-12 00:46 | ||
|
||
from django.db import migrations, models | ||
import uuid | ||
|
||
|
||
class Migration(migrations.Migration): | ||
|
||
dependencies = [ | ||
("export", "0013_rm_incrementalexport"), | ||
] | ||
|
||
operations = [ | ||
migrations.CreateModel( | ||
name="DeIdMapping", | ||
fields=[ | ||
( | ||
"id", | ||
models.AutoField( | ||
auto_created=True, | ||
primary_key=True, | ||
serialize=False, | ||
verbose_name="ID", | ||
), | ||
), | ||
("domain", models.TextField(max_length=255)), | ||
("hashed_value", models.TextField(max_length=32)), | ||
("deid", models.UUIDField(default=uuid.uuid4)), | ||
], | ||
options={ | ||
"indexes": [ | ||
models.Index( | ||
fields=["domain", "hashed_value"], | ||
name="export_deid_domain_3c63a4_idx", | ||
) | ||
], | ||
}, | ||
), | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
import hashlib | ||
import uuid | ||
|
||
from django.db import models | ||
|
||
from corehq.const import ONE_DAY | ||
from corehq.util.quickcache import quickcache | ||
|
||
|
||
class DeIdMapping(models.Model): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do I have it right that the purpose of this model is to map values to UUIDs so that a given value always has the same UUID? Could the model be eliminated by using the hash of the value + domain, possibly with a stronger hash function (SHA-256, for example)? That would make the from haslib import sha256
def get_deid(value, doc):
domain = doc["domain"] if isinstance(doc, dict) else doc.domain
data = f"{value} {domain}".encode("utf-8")
return sha256(data).hexdigest()[:32] We could also use something like a secret salt (possibly stored in settings.py, or even one per domain) to make it harder to re-identify the values, although that may be overkill. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yet another idea is to use the document ID as hash salt. |
||
domain = models.TextField(max_length=255) | ||
hashed_value = models.TextField(max_length=32) | ||
deid = models.UUIDField(default=uuid.uuid4) | ||
Comment on lines
+12
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had trouble coming up with good names for these fields, with fairly little context on export code in general. Open to alternatives here. Maybe something closer to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do stick with the db model, could we just make the primary key a uuid and use that instead of the deid field? (Code might not be 100% accurate) |
||
|
||
class Meta: | ||
indexes = [ | ||
models.Index(fields=['domain', 'hashed_value']), | ||
] | ||
|
||
@classmethod | ||
def get_deid(cls, value, doc): | ||
if isinstance(doc, dict): | ||
domain = doc['domain'] | ||
else: | ||
domain = doc.domain | ||
Comment on lines
+24
to
+25
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably good to have a test cover this code as well. |
||
return cls._get_deid(value, domain) | ||
|
||
@classmethod | ||
@quickcache(['value', 'domain'], timeout=90 * ONE_DAY) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 90 days is a long time. If the code behind this cache changes, it will not take effect until months later (for cached values). Can you explain more about why such a long cache timeout was chosen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hadn't considered code changes effectively being delayed by cache, which is a good point, but the length of time is because these values are essentially expected to stay the same forever once stored. The reason for caching here generally is so we're not making excessive DB calls for every repeated "sensitive ID" value in an export. Maybe there's a shorter timeout here that would be more convenient for consideration of working with this code in the future, but still mitigate most of the performance impact on exports. Maybe 1 month? 1 week? |
||
def _get_deid(cls, value, domain): | ||
hashed_value = cls._hash_value(value) | ||
deid_mapping, __ = cls.objects.get_or_create(domain=domain, hashed_value=hashed_value) | ||
return str(deid_mapping.deid) | ||
|
||
@staticmethod | ||
@quickcache(['value'], timeout=90 * ONE_DAY) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does the result of this function need to be cached? My thoughts: The cache adds a network hit (to redis), but the function itself does not hit the network, so the cache would most likely make the function slower. Also, MD5 hashing should be very fast unless There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cached it here without thinking about the additional network hit, just that the return value will stay the same for a given input so we "might as well" cache it. I assume you're probably correct that simply hashing the value and returning it would be more performant than caching it here, so I'll make that change. |
||
def _hash_value(value): | ||
if value is None: | ||
# None is a de-identifiable value but needs a string to encode for lookup | ||
value = '' | ||
Comment on lines
+38
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there's a meaningful difference between an empty string and Previously you would get the same output for the both |
||
|
||
return hashlib.md5(value.encode('utf-8')).hexdigest() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
from django.test import TestCase | ||
|
||
from corehq.apps.export.models import DeIdMapping | ||
|
||
|
||
class TestDeIdMapping(TestCase): | ||
def test_deid_unique_by_domain(self): | ||
value = 'somedatavalue' | ||
|
||
deid_one = DeIdMapping.get_deid(value, {'domain': 'test-domain-1'}) | ||
deid_two = DeIdMapping.get_deid(value, {'domain': 'test-domain-2'}) | ||
self.assertNotEqual(deid_one, deid_two) | ||
|
||
def test_deid_consistent_for_value_and_domain(self): | ||
value = 'somedatavalue' | ||
domain = 'test-domain' | ||
|
||
deid_one = DeIdMapping.get_deid(value, {'domain': domain}) | ||
deid_two = DeIdMapping.get_deid(value, {'domain': domain}) | ||
self.assertEqual(deid_one, deid_two) | ||
|
||
def test_none_is_a_deidentifiable_value(self): | ||
value = None | ||
|
||
deid = DeIdMapping.get_deid(value, {'domain': 'test-domain'}) | ||
self.assertIsNotNone(deid) |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,9 +1,12 @@ | ||||||
import importlib | ||||||
from django.http import Http404 | ||||||
|
||||||
from couchdbkit import ResourceNotFound | ||||||
|
||||||
from corehq.apps.accounting.models import Subscription | ||||||
from corehq.apps.accounting.utils import domain_has_privilege | ||||||
from corehq.apps.export.const import ( | ||||||
DEID_DATE_TRANSFORM, DEID_ID_TRANSFORM, DEID_TRANSFORM_FUNCTIONS | ||||||
) | ||||||
from corehq.privileges import DAILY_SAVED_EXPORT, DEFAULT_EXPORT_SETTINGS, EXCEL_DASHBOARD | ||||||
from corehq.toggles import MESSAGE_LOG_METADATA | ||||||
|
||||||
|
@@ -57,10 +60,25 @@ def get_default_export_settings_if_available(domain): | |||||
""" | ||||||
Only creates settings if the domain has the DEFAULT_EXPORT_SETTINGS privilege | ||||||
""" | ||||||
from corehq.apps.accounting.models import Subscription | ||||||
settings = None | ||||||
current_subscription = Subscription.get_active_subscription_by_domain(domain) | ||||||
if current_subscription and domain_has_privilege(domain, DEFAULT_EXPORT_SETTINGS): | ||||||
from corehq.apps.export.models import DefaultExportSettings | ||||||
settings = DefaultExportSettings.objects.get_or_create(account=current_subscription.account)[0] | ||||||
|
||||||
return settings | ||||||
|
||||||
|
||||||
def get_transform_function(func_name): | ||||||
module = importlib.import_module('corehq.apps.export.transforms') | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is
Suggested change
|
||||||
return getattr(module, func_name) | ||||||
nospame marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
|
||||||
def get_deid_transform_function(func_name): | ||||||
if func_name == DEID_TRANSFORM_FUNCTIONS[DEID_DATE_TRANSFORM]: | ||||||
module = importlib.import_module('couchexport.deid') | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
return getattr(module, func_name) | ||||||
elif func_name == DEID_TRANSFORM_FUNCTIONS[DEID_ID_TRANSFORM]: | ||||||
from corehq.apps.export.models import DeIdMapping | ||||||
return DeIdMapping.get_deid | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function will return |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -7,14 +7,18 @@ | |||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
from casexml.apps.case.mock import CaseBlock | ||||||||||||||||||||||||||||||||||||||||||||
from casexml.apps.case.util import property_changed_in_action | ||||||||||||||||||||||||||||||||||||||||||||
from couchexport.deid import deid_date, deid_ID | ||||||||||||||||||||||||||||||||||||||||||||
from dimagi.utils.parsing import json_format_datetime | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.case_search.const import INDEXED_METADATA_BY_KEY | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.data_interfaces.deduplication import DEDUPE_XMLNS | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.es import filters | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.es.cases import CaseES | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.export.const import DEID_DATE_TRANSFORM, DEID_ID_TRANSFORM | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.export.const import ( | ||||||||||||||||||||||||||||||||||||||||||||
DEID_DATE_TRANSFORM, | ||||||||||||||||||||||||||||||||||||||||||||
DEID_ID_TRANSFORM, | ||||||||||||||||||||||||||||||||||||||||||||
DEID_TRANSFORM_FUNCTIONS, | ||||||||||||||||||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.export.utils import get_deid_transform_function | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.receiverwrapper.util import submit_form_locally | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.apps.users.util import SYSTEM_USER_ID | ||||||||||||||||||||||||||||||||||||||||||||
from corehq.form_processor.exceptions import CaseNotFound, MissingFormXml | ||||||||||||||||||||||||||||||||||||||||||||
|
@@ -293,9 +297,11 @@ def get_deidentified_data(case, censor_data): | |||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
censored_value = '' | ||||||||||||||||||||||||||||||||||||||||||||
if transform == DEID_DATE_TRANSFORM: | ||||||||||||||||||||||||||||||||||||||||||||
deid_date = get_deid_transform_function(DEID_TRANSFORM_FUNCTIONS[DEID_DATE_TRANSFORM]) | ||||||||||||||||||||||||||||||||||||||||||||
censored_value = deid_date(case_value, None, key=case.case_id) | ||||||||||||||||||||||||||||||||||||||||||||
if transform == DEID_ID_TRANSFORM: | ||||||||||||||||||||||||||||||||||||||||||||
censored_value = deid_ID(case_value, None) | ||||||||||||||||||||||||||||||||||||||||||||
deid_id = get_deid_transform_function(DEID_TRANSFORM_FUNCTIONS[DEID_ID_TRANSFORM]) | ||||||||||||||||||||||||||||||||||||||||||||
censored_value = deid_id(case_value, case) | ||||||||||||||||||||||||||||||||||||||||||||
Comment on lines
299
to
+304
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Add error handling for de-identification function calls. The de-identification function calls lack error handling, which could lead to runtime errors if the functions fail. Add error handling: if transform == DEID_DATE_TRANSFORM:
- deid_date = get_deid_transform_function(DEID_TRANSFORM_FUNCTIONS[DEID_DATE_TRANSFORM])
- censored_value = deid_date(case_value, None, key=case.case_id)
+ try:
+ deid_date = get_deid_transform_function(DEID_TRANSFORM_FUNCTIONS[DEID_DATE_TRANSFORM])
+ censored_value = deid_date(case_value, None, key=case.case_id)
+ except Exception as e:
+ logger.error(f"Error applying date de-identification: {e}")
+ censored_value = ''
if transform == DEID_ID_TRANSFORM:
- deid_id = get_deid_transform_function(DEID_TRANSFORM_FUNCTIONS[DEID_ID_TRANSFORM])
- censored_value = deid_id(case_value, case)
+ try:
+ deid_id = get_deid_transform_function(DEID_TRANSFORM_FUNCTIONS[DEID_ID_TRANSFORM])
+ censored_value = deid_id(case_value, case)
+ except Exception as e:
+ logger.error(f"Error applying ID de-identification: {e}")
+ censored_value = '' 📝 Committable suggestion
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
if is_case_property: | ||||||||||||||||||||||||||||||||||||||||||||
props[attr_or_prop] = censored_value | ||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many of these do we expect to have? If the number grows very large, especially if there is a lot of churn, would it be worth using
BigAutoField
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a sense of how widely the de-identified data feature is used, but it seems plausible we could run up against the limit using if a regular
AutoField
. If we consider exports from our larger projects that might have several hundred thousand unique cases, if each of those has a few different de-identified fields, and a few different projects of that scale use this feature, it would add up pretty quickly, soBigAutoField
is likely worthwhile.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion of using
BigAutoField
pointed in a direction I'm not really in favor of. Sorry for the confusion. I'm strongly in favor of using a different hashing technique that doesn't require a new database model, such as using a unique salt per domain.