Dataset encryption #423
Labels
big
A big project that would require more than trivial fixes and enhancements.
enhancement
New feature or request
questionable
Currently, datasets are stored without any form of encryption, allowing anyone with access to the file to view the data.
This is OK in many circumstances, since datasets can be made private (i.e. only available to owners). But even then people with filesystem access to the server, as well as server admins, still have access to the files.
For sensitive data this is potentially problematic. The solution is to run your own 4CAT, but this is not always feasible, and even then in some circumstances encrypted file storage might be preferred (because this is an organisational requirement, etc). If we go forward with the media upload datasource (#419) people might upload sensitive data collected elsewhere (e.g. from recorded interviews), and it would be useful if secure storage could be offered for such data.
Since we already use zip files to store various types of datasets, an obvious solution would be to use encrypted zip archives. For datasets not currently stored as zip files the various methods to access the data (
iterate_items
, etc) could be amended to transparently store in and read from encrypted zip archives.Python's native
zipfile
does not support encrypted archives well, but for example pyzipper seems to be a robust and mostly drop-in alternative.A question is how to handle access to the archive. To run processor on encrypted data, the encryption key would need to be available on the server, at least temporarily. We already have some code in place to handle credentials for APIs et cetera, which are kept on disk as briefly as possible and deleted once no longer necessary. A similar compromise could be used here.
The text was updated successfully, but these errors were encountered: