Skip to content

Commit

Permalink
Merge pull request #1082 from ScilifelabDataCentre/dev
Browse files Browse the repository at this point in the history
Prep before production
  • Loading branch information
i-oden authored Apr 1, 2022
2 parents dae1805 + 528c675 commit d69de9d
Show file tree
Hide file tree
Showing 59 changed files with 3,096 additions and 890 deletions.
63 changes: 63 additions & 0 deletions .gitpod.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
# based on https://github.com/gitpod-io/template-docker-compose

# multi-repo
additionalRepositories:
- url: https://github.com/ScilifelabDataCentre/dds_cli
checkoutLocation: dds_cli

tasks:
- name: Build backend and run server
init: >
chmod a+x /workspace/ &&
docker compose build --pull
command: docker compose --profile cli up
- name: Open dds cli
openMode: split-right
command: >
gp await-port 5000 &&
echo -e "\033[1;31mUse the dds cli in this terminal window\033[0m\n\033[0;33me.g.: dds auth login\033[0m" &&
docker exec -it dds_cli bash
ports:
- port: 5000 # backend
onOpen: open-preview
visibility: public
- port: 1080 # mailcatcher
# Can't have more than one preview at once currently :(
# open-browser is blocked by Chrome pop-up blocker
onOpen: ignore
visibility: public
- port: 9000 # minio
onOpen: ignore
visibility: public
- port: 9001 # minio
onOpen: ignore
visibility: public
- port: 3306 # db
onOpen: ignore
visibility: public

vscode:
extensions:
- ms-azuretools.vscode-docker
- ms-python.python
- esbenp.prettier-vscode # Linting and style checking
- Gruntfuggly.todo-tree # Display TODO and FIXME in a tree view in the activity bar

github:
prebuilds:
# enable for the default branch (defaults to true)
master: true
# enable for all branches in this repo (defaults to false)
branches: false
# enable for pull requests coming from this repo (defaults to true)
pullRequests: true
# enable for pull requests coming from forks (defaults to false)
pullRequestsFromForks: true
# add a check to pull requests (defaults to true)
addCheck: true
# add a "Review in Gitpod" button as a comment to pull requests (defaults to false)
addComment: false
# add a "Review in Gitpod" button to the pull request's description (defaults to false)
addBadge: true
6 changes: 6 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"editor.defaultFormatter": "esbenp.prettier-vscode",
"editor.formatOnSave": true,
"workbench.startupEditor": "none",
"python.formatting.provider": "black"
}
118 changes: 117 additions & 1 deletion ADR.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# Architecture Decision Record (ADR)

---

# 1. Framework: Flask

**Date/year:** 2019/2020

## Alternatives and comparisons

### Tornado
Expand Down Expand Up @@ -41,8 +45,12 @@

Since Flask is flexible and simple, extensions provide a large variety of functionalities including REST API support, it has an integrated testing system and there is more online support than for Tornado, **Flask** was chosen as the better option for the Data Delivery System framework. The built-in asynchronicity in Tornado is not an important feature since the system will not be used by thousands of users at a given time.

---

# 2. Database: MariaDB

**Date/year:** 2019/2020

The initial database used under development was the non-relational CouchDB. As development progressed, it became of interest to investigate whether this was the best approach or whether a relational database was better for the systems purposes.

## Alternatives and comparisons
Expand Down Expand Up @@ -72,8 +80,12 @@ The main motivation behind choosing a **relational database** is that we are for

**MariaDB** was chosen (rather than e.g. MySQL) as the relational database because of it’s query performance and that it provides many useful features not available in other relational databases.

---

# 3. Compression algorithm: ZStandard

**Date/year:** 2019/2020

GNU zip (Gzip) is the most popular compression algorithm and is suitable for compression of data streams, is supported by all browsers and comes as a standard in all major web servers. However, while gzip provides a good compression ratio (original/compressed size), it is very slow compared to other algorithms.

The encryption speed using ChaCha20-Poly1305 (in this case tested on a 109 MB file) is around 600 MB/s, but when adding compression as a preceding step, the speed was less than 3 MB/s and the compression ratio 3,25. Since the delivery system will be dealing with huge files, it’s important that the processing is efficient, and therefore that the chosen algorithms are fast. Due to this, Zstandard was tested with the same chunk size, resulting in a speed of 119 MB/s and a compression ratio of 3,1.
Expand All @@ -82,8 +94,12 @@ The encryption speed using ChaCha20-Poly1305 (in this case tested on a 109 MB fi

Since Zstandard gave approximately the same compression ratio in a fraction of the time, **Zstandard** was chosen as the algorithm to be implemented within the Data Delivery System.

---

# 4. In-transit vs local encryption: Local (for now, looking at options)

**Date/year:** 2019/2020

The most efficient way of delivering the data to the owners would be to perform encryption- and decryption in-transit, thereby decreasing the amount of memory required locally and possibly the total delivery time. However, there have been some difficulties finding a working solution for this, including that the `crypt4gh` Python package used in the beginning of the development did not support it.

On further investigation and contact with Safespring, we learned:
Expand All @@ -93,8 +109,12 @@ On further investigation and contact with Safespring, we learned:
- All users of the Safespring backup service perform encryption on their own and handle the keys themselves.
- Due to this, the encryption will be performed locally before upload to the S3 storage.

---

# 5. Encryption Algorithm: ChaCha20-Poly1305

**Date/year:** 2019/2020

The new encryption format standard for genomics and health related data is Crypt4GH, developed by the Global Alliance for Genomics and Health and first released in 2019. The general encryption standard, however, is AES of which AES-GCM is an authenticated encryption mode. AES is currently (since 2001) the block cipher Rijndael. ChaCha20 is a stream cipher and is used within the Crypt4GH format. The most secure, efficient and generally appropriate format and algorithm should be implemented within the Data Delivery System.

## Alternatives and comparisons
Expand Down Expand Up @@ -203,8 +223,12 @@ Files larger than 256 GiB will need to be partitioned, however the number of par

Due to this, **ChaCha20-Poly1305** was chosen as the encryption algorithm within the Data Delivery System.

---

# 6. File chunk size: 64 KiB

**Date/year:** 2019/2020

Compression and encryption is performed in a streamed manner to avoid reading large files completely in memory. This is done by reading the file in chunks, and the size of the chunks affect the speed of the algorithms, their memory usage, and the final size of the compressed and encrypted files.

![Zstandard testing graph](/img/zstandard.png)
Expand All @@ -215,8 +239,12 @@ To find the optimal chunk size, a 33 GB file was compressed and encrypted using

The Data Delivery System will read the files in 64 KiB chunks.

---

# 7. File integrity guarantee: Nonce incrementation

**Date/year:** 2019/2020

As described in section 5. above, the Crypt4GH format encrypts the files in blocks of 64 KiB, after which each data blocks unique nonce, ciphertext and MAC are saved to the c4gh file. This guarantees the integrity of the data blocks, however it does not guarantee the integrity of the entire file, and it is therefore possible that some blocks are rearranged, duplicated or missing, without the recipient knowing. Although we have chosen to not use the Crypt4GH format within the delivery system, we do use the same encryption algorithm – ChaCha20-Poly1305 – and (since we cannot read huge files in memory) we have chosen to read the files in equally sized chunks. Therefore the integrity issue can potentially give huge problems for the delivery system.

## Alternatives and comparisons
Expand All @@ -243,8 +271,12 @@ Due to this, no checksum verification is used during the upload. However, the fi

Nonce incrementation will be used and no checksum verification will be performed during upload.

---

# 8. Password Authentication: Argon2id

**Date/year:** 2021

Argon2 is also available in two other versions. These are argon2d (strong GPU resistance) and argon2i (resistant to side-channel attacks). Argon2id is a combination of the two and is the recommended mode.

## Decision
Expand All @@ -253,8 +285,92 @@ The Data Delivery System will use [Argon2id](https://github.com/hynek/argon2-cff

> The chosen parameters will be added here soon.
# 9. Requirements: No pinned versions
---

# 9. User roles: Super Admin, Unit Admin, Unit Personnel, Researcher

**Date:** September 14th 2021

## Decision

### Super Admin (DC)

- Manage: Add, Remove, Edit
- Unit (instances)
- Users

### Unit Admin

- Unit Personnel Permissions
- Manage: Add, Add to project, Remove from project, Remove account, Change permissions
- Unit Admin
- Unit User

### Unit Personnel

- Project Owner Permissons
- Upload
- Delete

### Project Owner

- Research User Permissions
- Manage: Invite, Add to project, Remove from project, Remove account, Change permissions
- Project Owners
- Research Users

### Research Account

- Remove own account
- List
- Download

---

# 10. HOTP as default

**Date:** December 1st 2022

Initially, TOTP was implemented as the Two Factor Authentication. Authentication apps such as Authy or Google Authenticator could be set up and used to identify a user. However, due to some technical difficulties for some users, it was decided that we need to allow 2FA via mail as a default.

## Decision

Use email 2FA (using HOTP) as a default. 2FA with authenticator apps (with TOTP) will be implemented _at some point_ and the users will be able to choose which method they want to use.

---

# 11. Structured logging for action log

**Date:** January 12th 2022

Example: https://newrelic.com/blog/how-to-relic/python-structured-logging

## Decision

Structured logging should be implemented on the action logging first - the logging which saves when a user tries to perform a specific action e.g. upload/download/list/auth/rm etc.

The information required to be logged: username, action, result (failed/successful), time, project in which the action was attempted.

When the action logs have been fixed we will discuss whether or not this will be implemented in the general logging as well, such as debugging and general system info.

---

# 12. Requirements: No pinned versions

**Date:** March 1st 2022

## Decisions

We will not pin the requirement versions. If at some point something stops working we will look into it then and update the requirements then. This will simplify the installation for the users which is one of our priorities.

---

# 13. No `--username` option

**Date:** March 2nd 2022

Previously there was a `--username` option for all commands where the user could specify the username.

## Decision

We will not have the `--username` option. When using `dds auth login` command, either the existing encrypted token will be used or the user will be prompted to fill in the username and password.
26 changes: 25 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Please add a _short_ line describing the PR you make, if the PR implements a spe
- Add support for getting IPs from X-Forwarded-For ([#952](https://github.com/ScilifelabDataCentre/dds_web/pull/952))
- Relax requirements for usernames (wider length range, `.` and `-`) ([#943](https://github.com/ScilifelabDataCentre/dds_web/pull/943))
- Delay committing project to db until after the bucket has been created ([#967](https://github.com/ScilifelabDataCentre/dds_web/pull/967))
- Fix logic for notification about sent email ([#963])(https://github.com/ScilifelabDataCentre/dds_web/pull/963))
- Fix logic for notification about sent email ([#963](https://github.com/ScilifelabDataCentre/dds_web/pull/963))
- Extended the `dds_web.api.dds_decorators.logging_bind_request` decorator to catch all not yet caught exceptions and make sure they will be logged ([#958](https://github.com/ScilifelabDataCentre/dds_web/pull/958)).
- Increase the security of the session cookie using HTTPONLY and SECURE ([#972](https://github.com/ScilifelabDataCentre/dds_web/pull/972))
- Add role when listing project users ([#974](https://github.com/ScilifelabDataCentre/dds_web/pull/974))
Expand All @@ -45,3 +45,27 @@ Please add a _short_ line describing the PR you make, if the PR implements a spe
## Sprint (2022-03-09 - 2022-03-23)

- Introduce a separate error message if someone tried to add an unit user to projects individually. ([#1039](https://github.com/ScilifelabDataCentre/dds_web/pull/1039))
- Catch KeyNotFoundError when user tries to give access to a project they themselves do not have access to ([#1045](https://github.com/ScilifelabDataCentre/dds_web/pull/1045))
- Display an error message when the user makes too many authentication requests. ([#1034](https://github.com/ScilifelabDataCentre/dds_web/pull/1034))
- When listing the projects, return whether or not the user has a project key for that particular project ([#1049](https://github.com/ScilifelabDataCentre/dds_web/pull/1049))
- New endpoint for Unit Personnel and Admins to list the other Unit Personnel / Admins within their project ([#1050](https://github.com/ScilifelabDataCentre/dds_web/pull/1050))
- Make previous HOTP invalid at password reset ([#1054](https://github.com/ScilifelabDataCentre/dds_web/pull/1054))
- New PasswordReset table to keep track of when a user has requested a password reset ([#1058](https://github.com/ScilifelabDataCentre/dds_web/pull/1058))
- New endpoint for listing Units as Super Admin ([1060](https://github.com/ScilifelabDataCentre/dds_web/pull/1060))
- New endpoint for listing unit users as Super Admin ([#1059](https://github.com/ScilifelabDataCentre/dds_web/pull/1059))
- Future-proofing the migrations ([#1040](https://github.com/ScilifelabDataCentre/dds_web/pull/1040))
- Return int instead of string from files listing and only return usage info if right role ([#1070](https://github.com/ScilifelabDataCentre/dds_web/pull/1070))
- Batch deletion of files (breaking atomicity) ([#1067](https://github.com/ScilifelabDataCentre/dds_web/pull/1067))
- Change token expiration time to 7 days (168 hours) ([#1061](https://github.com/ScilifelabDataCentre/dds_web/pull/1061))
- Add possibility of deleting invites (temporary fix in delete user endpoint) ([#1075](https://github.com/ScilifelabDataCentre/dds_web/pull/1075))
- Flask command `create-unit` to create unit without having to interact with database directly ([#1075](https://github.com/ScilifelabDataCentre/dds_web/pull/1075))
- Let project description include . and , ([#1080](https://github.com/ScilifelabDataCentre/dds_web/pull/1080))
- Catch OperationalError if there is a database malfunction in `files.py` ([#1089](https://github.com/ScilifelabDataCentre/dds_web/pull/1089))
- Switched the validation for the principal investigator from string to email ([#1084](https://github.com/ScilifelabDataCentre/dds_web/pull/1084)).

## Sprint (2022-03-23 - 2022-04-06)

- Add link in navbar to the installation documentation ([#1112](https://github.com/ScilifelabDataCentre/dds_web/pull/1112))
- Change from apscheduler to flask-apscheduler - solves the app context issue ([#1109](https://github.com/ScilifelabDataCentre/dds_web/pull/1109))
- Send an email to all Unit Admins when a Unit Admin has reset their password ([#1110](https://github.com/ScilifelabDataCentre/dds_web/pull/1110)).
- Patch: Add check for unanswered invite when creating project and adding user who is already invited ([#1117](https://github.com/ScilifelabDataCentre/dds_web/pull/1117))
Loading

0 comments on commit d69de9d

Please sign in to comment.