Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset encryption #423

Open
stijn-uva opened this issue Apr 8, 2024 · 2 comments
Open

Dataset encryption #423

stijn-uva opened this issue Apr 8, 2024 · 2 comments
Labels
big A big project that would require more than trivial fixes and enhancements. enhancement New feature or request questionable

Comments

@stijn-uva
Copy link
Member

Currently, datasets are stored without any form of encryption, allowing anyone with access to the file to view the data.

This is OK in many circumstances, since datasets can be made private (i.e. only available to owners). But even then people with filesystem access to the server, as well as server admins, still have access to the files.

For sensitive data this is potentially problematic. The solution is to run your own 4CAT, but this is not always feasible, and even then in some circumstances encrypted file storage might be preferred (because this is an organisational requirement, etc). If we go forward with the media upload datasource (#419) people might upload sensitive data collected elsewhere (e.g. from recorded interviews), and it would be useful if secure storage could be offered for such data.

Since we already use zip files to store various types of datasets, an obvious solution would be to use encrypted zip archives. For datasets not currently stored as zip files the various methods to access the data (iterate_items, etc) could be amended to transparently store in and read from encrypted zip archives.

Python's native zipfile does not support encrypted archives well, but for example pyzipper seems to be a robust and mostly drop-in alternative.

A question is how to handle access to the archive. To run processor on encrypted data, the encryption key would need to be available on the server, at least temporarily. We already have some code in place to handle credentials for APIs et cetera, which are kept on disk as briefly as possible and deleted once no longer necessary. A similar compromise could be used here.

@stijn-uva stijn-uva added enhancement New feature or request big A big project that would require more than trivial fixes and enhancements. questionable labels Apr 8, 2024
@sal-uva
Copy link
Collaborator

sal-uva commented Apr 8, 2024

I'd be super for this! Would allow 4CAT to be used in many different research contexts.

I guess the last point is possible as 'dataset passwords' that allow you to both access and decrypt datasets? Which can be stored as a cookie and deleted from the server after iterating over a dataset?

@stijn-uva
Copy link
Member Author

stijn-uva commented Apr 8, 2024

Yes, we currently have the sensitive and cache options for processor input fields, which together make 4CAT handle input this way:

# now the parameters have been loaded into memory, clear any sensitive

// Cache cacheable values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big A big project that would require more than trivial fixes and enhancements. enhancement New feature or request questionable
Projects
None yet
Development

No branches or pull requests

2 participants