-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update disaster recovery guidance #198
Conversation
Review app https://technical-guidance-198.test.teacherservices.cloud was deleted |
011fc31
to
fd37bdd
Compare
fd37bdd
to
a99516f
Compare
a99516f
to
8322d73
Compare
8322d73
to
be7cbc1
Compare
@@ -7,7 +7,7 @@ weight: 40 | |||
|
|||
<%= partial('partials/page_toc') %> | |||
|
|||
This document is intended to list technical risks to our digital services and the mitigations we have in place. | |||
This document is intended to list technical risks to our digital services and the mitigations we have in place.<br/><br/>For any issue affecting a service (or services) always [check if there are any dependant services](https://educationgovuk.sharepoint.com/sites/teacher-services-infrastructure/SitePages/Teacher-services-dependencies.aspx) that may also be affected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is intended to list technical risks to our digital services and the mitigations we have in place.<br/><br/>For any issue affecting a service (or services) always [check if there are any dependant services](https://educationgovuk.sharepoint.com/sites/teacher-services-infrastructure/SitePages/Teacher-services-dependencies.aspx) that may also be affected. | |
This document is intended to list technical risks to our digital services and the mitigations we have in place.<br/><br/>For any issue affecting a service (or services) always [check if there are any dependent services](https://educationgovuk.sharepoint.com/sites/teacher-services-infrastructure/SitePages/Teacher-services-dependencies.aspx) that may also be affected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|Impact|Applications may be unavailable. Data may be lost.| | ||
|Prevention|Approved PIM request required for production Azure access.<br/>Pull Requests require at least 1 approval.<br/>Soft delete and versioning enabled for key vaults and storage accounts.<br/>Azure resource locks placed on important resources [Azure locks](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/lock-resources?tabs=json). | | ||
|Detection|Endpoint monitoring may point to a healthcheck page that is now failing. Or smoke tests running in production may detect it.| | ||
|Remediation|Recovery dependant on the resource deleted, either restore correct version or redeploy and restore data from backup| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|Remediation|Recovery dependant on the resource deleted, either restore correct version or redeploy and restore data from backup| | |
|Remediation|Recovery dependent on the resource deleted, either restore correct version or redeploy and restore data from backup| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry! a dependant is a person who is dependent on someone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, and fixed
## Loss of Azure/AWS availability zone | ||
We deploy to the UK South or West Europe regions which have 3 separate availability zones (AZ). It may happen that one of them is unavailable: either network, compute or storage services are affected. | ||
|
||
||| | ||
|-|-| | ||
|Impact|Applications may be slow or unavailable| | ||
|Prevention|Applications should be built with failure in mind: deploy multiple application instances and deploy databases in cluster mode. Spread them across multiple AZs for high availability.<br/>Our AKS clusters are spread across 3 AZs. Scale applications to more than 1 replicas and enable zone redundancy.| | ||
|Prevention|Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZS. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy.<br/>Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover).<br/>Azure key vault uses replication within region and to a paired region (automatic failover).<br/>Azure Postgres can be configured zone redundant, with the active/standby instances in different zones.<br/>Azure Redis can be zone redundant only if using the Premium SKU.<br/>Postgres and Redis utilise automatic failover. Postgres can also be failed over manually.<br/>Cluster Public IP addresses should be configured with zone redundancy [Azure PIP redundancy](https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/public-ip-addresses#availability-zone)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|Prevention|Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZS. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy.<br/>Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover).<br/>Azure key vault uses replication within region and to a paired region (automatic failover).<br/>Azure Postgres can be configured zone redundant, with the active/standby instances in different zones.<br/>Azure Redis can be zone redundant only if using the Premium SKU.<br/>Postgres and Redis utilise automatic failover. Postgres can also be failed over manually.<br/>Cluster Public IP addresses should be configured with zone redundancy [Azure PIP redundancy](https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/public-ip-addresses#availability-zone)| | |
|Prevention|Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZs. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy.<br/>Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover).<br/>Azure key vault uses replication within region and to a paired region (automatic failover).<br/>Azure Postgres can be configured zone redundant, with the active/standby instances in different zones.<br/>Azure Redis can be zone redundant only if using the Premium SKU.<br/>Postgres and Redis utilise automatic failover. Postgres can also be failed over manually.<br/>Cluster Public IP addresses should be configured with zone redundancy [Azure PIP redundancy](https://learn.microsoft.com/en-us/azure/virtual-network/ip-services/public-ip-addresses#availability-zone)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -65,7 +75,7 @@ In some rare cases, an entire region might become unavailable. | |||
||| | |||
|-|-| | |||
|Impact|Applications may be unavailable| | |||
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.| | |||
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.<br/>Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate reqion. Any critcal application data kept in a storage account should be GRS/GZRS.<br/>Azure Key Vault maintains a copy of the contents in another region.<br/>For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.<br/>Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate reqion. Any critcal application data kept in a storage account should be GRS/GZRS.<br/>Azure Key Vault maintains a copy of the contents in another region.<br/>For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. | | |
|Prevention|For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.<br/>Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate region. Any critical application data kept in a storage account should be GRS/GZRS.<br/>Azure Key Vault maintains a copy of the contents in another region.<br/>For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
be7cbc1
to
3f81035
Compare
Update disaster recovery guidance
https://trello.com/c/g4YUcnbV/1860-dr-update-tech-guidance