This document explains the high level architecture of the Trusted Research Environment that would be deployed on AWS Cloud following the installation steps in this repository.
The TREEHOOSE solution is formed of Service Workbench on AWS and a data lake that together provide the building blocks for the Trusted Research Environment (TRE) capability. AWS Control Tower provides the scalable multi-account setup for managing TRE implementations at scale in AWS Cloud.
In addition to the basic building blocks TREEHOOSE solution provides optional add-on components to enable
- Data egress
- Workspace backups
- Budget controls
TREEHOOSE is the Trusted Research Environment (TRE) implementation that will be deployed for each research project. Deploying the solution with the default parameters builds the following environment in AWS Cloud.
The solution uses Infrastructure as Code for deployment. Additional sections in this document provide additional details about each component. Below is a brief explanation of the numbered steps in the diagram.
- TRE Data Managers use AWS Management console to upload data to the TRE Data Lake to be used for research.
- IT Administrators use the Service Workbench web application to administer resources in the TRE environment.
- The budget controls component is used to set budget limits for the TRE project. IT Administrators can set the budget and any actions to be taken when the budget thresholds are breached.
- Backup functionality for research workspaces can also be enabled. IT Admins can monitor these through AWS Backup.
- Data Managers and IT Administrators can work together to provide researchers with access to relevant data sets from the data lake.
- Researchers can create and connect to approved workspaces through the Service Workbench web application. They get secure access to compute resources using Amazon AppStream 2.0.
- On research completion the researcher can request egress of research results.
- The egress request is processed through a Data Egress App add-on with a comprehensive review process with multiple approvers before the data is available for download.
- Egress requests that are approved can be downloaded by Data Egress Managers and shared with the Researcher who requested the data egress. There is a configurable limit to the number of downloads which can be made.
- Audit & Compliance teams get full visibility into all user activities resulting in AWS API calls through centralised CloudTrail logs. Additionally, they get breakglass access to all TRE projects/accounts in the TRE through a Lambda function role in the Audit account.
Using the TREEHOOSE implementation allows a user to run multiple isolated TRE projects in parallel and to scale according to the organisation's research needs.
The TREEHOOSE TRE implementation supports scalable research workloads, aims to meet an organization’s security and auditing requirements, and can evolve with the business demands. To meet this goal, an AWS Control Tower provides the setup to govern a secure, multi-account AWS environment, called a landing zone.
Below is the high-level Organizational Unit and Account Structure that will be setup by using the TREEHOOSE solution.
Service Workbench on AWS is a cloud solution that enables IT teams to provide secure, repeatable, and federated control of access to data, tooling, and compute power that researchers need. Find more details here.
Key Components :
- For the UI: AWS Lambda, AWS Lambda@Edge, Amazon CloudFront, Amazon S3. AWS SSO can be used for Single Sign-On (optional).
- For the backend: Amazon API Gateway, AWS Lambda, AWS Step Functions, AWS Service Catalog, Amazon DynamoDB, Amazon Cognito, Amazon S3.
- For research environments: AWS Service Catalog and AWS CloudFormation for deploying the environments; Amazon EC2, Amazon SageMaker, Amazon EMR, Amazon S3, ... (more services as desired; this is customisable by providing Service Catalog templates).
- For the secure access environment: AWS AppStream 2.0
TREEHOOSE uses a data lake setup that leverages AWS Lake Formation to create a secure and scalable data store for storing research data. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. It creates a pre-configured data lake to be used for TRE data pipelines. This is a mandatory add-on.
Key Components :
- AWS Lake Formation, Amazon S3, AWS KMS, AWS Glue, Amazon Athena
This add-on provides a data egress approval workflow for researchers to take out data from TRE with the permission of multiple parties (data manager, research IT, etc.). The add-on is hosted as a web application supported by backend infrastructure. Each add-on installation is tied to a specific TRE project.
The add-on provides a streamlined process for securely egressing data from the TRE environment while keeping the TRE admins and Data auditors in complete control of the process.
All data egress requests and any actions performed on those are recorded for Audit.
Key Components :
- For the UI: AWS Amplify
- For the backend: AWS Step Functions, Amazon EFS, AWS Lambda, Amazon DynamoDB, Amazon SES, Amazon S3, AWS KMS, Amazon SNS, Amazon Cognito, AWS AppSync
This add-on provides capability to periodically backup researcher workspace to ensure that persistent data is recoverable in-case researcher workspace is terminated by mistake.
Once implemented this capability will enable researchers to select whether they want to enable periodic workspace backups when creating the workspace.
Only TRE administrators can control the backup frequency and retention periods. Also, any restore operations need to be performed by admins.
This component uses:
- AWS Backup for backing up block storage attached to Amazon EC2 based compute workspaces
- a be-spoke implementation to backup Amazon SageMaker Notebook Instances
Below diagrams explain how the backup solution works for
- EC2 based workspaces.
- SageMaker notebook based workspaces
Key Components:
- For the backend: AWS Step Functions, AWS Lambda, Amazon CloudWatch Events, AWS CloudFormation, AWS Backup, Amazon S3
Budget controls is an optional component that allows administrators and finance stakeholders of the TRE to stay on top of project finances. This component can optionally be deployed for each TRE project and allows to
- Monitor : set thresholds for sending budget alerts
- Report : sending notification on budget usage
- Respond : automate actions to avoid over-spending
The component uses AWS Budgets to plan and set expectations around TRE project costs.
Key Components:
- For the backend: AWS Budgets, Amazon SNS, AWS IAM
- https://docs.aws.amazon.com/organizations/latest/userguide/orgs_getting-started_concepts.html
- https://docs.aws.amazon.com/prescriptive-guidance/latest/designing-control-tower-landing-zone/account-structure-ous.html
- https://aws.amazon.com/government-education/research-and-technical-computing/service-workbench/
- https://docs.aws.amazon.com/aws-backup/latest/devguide/how-it-works.html