Skip to content

A look at the risks, benefits, and tradeoffs of obfuscated data

Casey Gollan edited this page May 12, 2015 · 1 revision

Another visiting artist, Kyle McDonald, also provided lots of useful feedback and debate on Twitter. One problem that Kyle pointed out is how it’s not exactly straightforward for somebody to opt out of being part of our report. This is because if one person decides that they want to be anonymous, simple arithmetic or process of elimination could quickly point to their identity. Because we haven’t yet been asked to opt-out an individual, or permanently scrub information from our records, we haven’t yet had to grapple with the thornier aspects of this problem.

My friend David Yee also raised a perhaps useful hypothetical situation about maintaining our records long into the future. What if, for reasons we can’t yet know because they don’t yet exist, somebody in the document needs to have their identity scrubbed from it? Not only would we need to delete the information from the most current version of the document (which lives in in a Github repository) but we would need to search the history of the document for any related mentions and scrub them too. This is why, David told me, he is always careful about putting personally identifiable information into a version-controlled form. His suggestion was to use, “unique identifiers that link to non-versioned records of consenting humans — “Speaker A”, “Student B” — linked to an strict opt-in database of identities.” Though we haven’t yet implemented such a bulletproof system, David’s final words of advice that, “Empathy is the secret sauce of big data,” definitely informed the statement on transparency that I ended up drafting.

Based on these constructive criticisms and gotchas!, as well as the objectives we had set forth in terms of open finances, Amit, Taeyoon, and I talked about using pseudonyms instead of real names in this report. One reason that I resisted the idea of partially obfuscating the data is that I believe the more steps we put between our raw data and the public, the more opportunities we introduce for it to be messed up, either by malfeasance or human error. In thinking about administration as a system that we are designing, we would be weakening its trustworthiness. Perhaps if we had the time and resources, it would be interesting to create an auditable program which automates and anonymizes the release of our records at a more frequent interval than one per term, while allowing for individuals to claim and reveal their identity within the documents.