ADR: Handling AWS-based backing service maintenance #441

risicle · 2022-05-23T12:51:56Z

What

Perhaps this shouldn't be an ADR as it is possibly more of a RFC, but we don't have a mechanism for RFCs right now. Instead here's a draft PR of the ADR to take a look at, and perhaps people can discuss whether this is a good idea here.

How to review

👀

Who can review

Humans

pauldougan · 2022-05-24T07:12:25Z

I have added the RFC idea to the retro board for the team to discuss today

pauldougan · 2022-05-24T07:12:25Z

I'll add the RFC idea to the retro board for the team to discuss today

pauldougan · 2022-05-24T07:45:15Z

source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md

+ This flag could also have a third setting, which would allow the tenant to only apply "mandatory" maintenance eagerly.
+ The default value for this flag is open for discussion, though leaving eager updates "off" by default would fail to solve the bulk of the problem. "Eager maintenance" is probably most useful to tenants who aren't attentive enough to discover this flag and turn it on.
+
+As a stretch goal, we could devise a mechanism to communicate upcoming maintenance to tenants, though this has some significant caveats:


perhaps an interface in PaaS admin to permit org managers to view and set maintenance windows on an org wide basis? This would need some user research I imagine.

Maybe in 2025.

source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md

cadmiumcat · 2022-06-10T08:40:45Z

source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md

+
+ - The ability to set a service instance's maintenance window. The RDS broker already allows this.
+ - A periodic job that will trigger available maintenance for a managed service instance in its maintenance window. The Elasticache broker would need a job that ran *at least* every hour to guarantee running within any possible maintenance window (which can be minimum 1 hour wide) because Elasticache lacks the ability to set maintenance to run in the "next maintenance period". There would be advantages to using this technique in the RDS broker too - use of the "apply at next maintenance window" feature means there's a time period *between* then and the start of that maintenance period where it's too late to stop a piece of maintenance using the flag described below. Using our own broker to schedule the maintenance avoids this danger. The RDS broker already has a "cron" mechanism which it uses for deleting old snapshots. Note that the minimum maintenance window for an RDS instance is half an hour, meaning our periodic job would need to run twice as frequently.
+ - A flag on service instances that would allow the above "eager maintenance" feature to be turned off. A tenant might want to use this flag when they have a particularly busy or critical period approaching for their service. Using this would give them an "update holiday", which the periodic job would catch up on missed maintenance. Tenants that were extremely averse to updates could leave this flag off permanently, which would result in something close to the current situation - important updates being force-applied by AWS after the maximum deferral period.


I guess this means we need to let users know the "apply by" date for each update?

There "severity" of elasticache updates:

critical: Recommended to apply immediately (within 14 days or less)

important: Recommended to apply as soon as your business flow allows (within 30 days or less)

medium: Recommended to apply within 60 days or less

low: Recommended to apply within 90 days or less

I think the "update holiday" works with important, medium and low updates. But for critical updates, should we allow this as well?

I guess this means we need to let users know the "apply by" date for each update?

Not necessarily. When you go on holiday you set your answerphone, but you don't necessarily know whether anyone's going to call or not.

(granted, 20 year out of date example)

As far as the severity levels go, it's not terribly well-defined how they relate to updates being "mandatory" and my tendency would be to allow AWS to be the ones to decide which ones are mandatory or not, and not make any extra stipulations of our own.

Much of the idea of this flag is to allow a tenant to opt out entirely of our new eager-update feature if they find it doesn't fit their needs.

Say you're planning to have a "holiday" for a month, but there is a mandatory update happening on the first week of it because that's its due date - what happens then?

Checking I've understood - currently we choose to apply updates, thus communicating a maintenance window, could result in some downtime, tenants react variously, huge support burden. Proposal - we automate maintenance as default giving tenants the ability to override this as and when or permanently?

@cadmiumcat well, if it's in the first week it's likely that it got eagerly-applied before you went into holiday mode because the minimum forced-application time appears to be 14 days. If it hadn't for some reason, it would happen during the holiday, because there's nothing an RDS tenant can do to stop that.

@wryobservations what we "currently do" is quite inconsistent. We only occasionally apply a mass update because it's currently painful. Most the time if we're honest we do the equivalent of having permanent holiday mode, letting only the critical updates happen quite late.

AP-Hunt · 2022-06-15T08:56:09Z

source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md

+
+Both service types have a concept of a "maintenance window", a time period when most maintenance will be scheduled to take place. This time period is selectable in AWS, but we only currently expose this configurability to tenants in the RDS broker.
+
+Only very critical updates will be force-applied in this maintenance window. For less critical updates, we (the PaaS team) receive a notification that a new piece of maintenance needs to take place, and we can then make a choice as to when the maintenance occurs. For RDS, one of these choices is to perform it in the "next maintenance window". The problem with this though is we have over 800 databases in the London region alone, and the people who are best placed to know when's best to apply these updates are tenants. So we can find ourselves with a lot of communication to do to a lot of tenants, many of whom will have further questions or want boutique actions taken. In short, it's a support bomb. As a result, most of the lower priority updates sit in our "TODO" pile indefinitely and eventually get force-applied by AWS at the end of their acceptable deferral period. This is far from ideal, and is arguably as disruptive as applying all mandatory updates in the first available maintenance window, as maintenance packages will eventually be reaching the end of their deferrable period at approximately same rate as new ones appear, just **many** months late. RDS also has non-mandatory maintenance updates, which in theory can be deferred indefinitely, but these don't seem to have been common lately.


Do we have any idea how many times we've had emails about security updates to RDS instances? My feeling is it's not many.

That's a good question. I can't recall if they ever explicitly mention specifically security in their notifications

AP-Hunt · 2022-08-12T15:28:40Z

I've made some changes to make the ADR match our discussions. I'd appreciate some input on paragraphs 4 and 5 of the decision section.

risicle · 2022-08-15T11:01:06Z

source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md

- - The more tenants see about these updates, the more likely we are to get boutique support requests for special handling of their service instance and information on whether a specific update has been applied to their service.
- - If we publish this information in a general, whole-platform way, (easy for Elasticache, much harder for RDS), we leave tenants to figure out for themselves which maintenance item applies to each of their instances.
- - If we push this information to them, per-instance, we're just passing on the noise to them that we ourselves find so hard to handle. We also don't have a great way of targeting the relevant users - we tend to email org managers, which, in large organizations, can be a very blunt tool.
+Both brokers will need a mechanism for disabling maintenance updates on service instances. Given the impending decomissioning of the platform, we feel that building this out in a self-service mannner could be unnecessarily burdonsome. Having a list of identifiers in a config file which we update in response to tenant tickets would likely suffice.


Alternatively we could manually set a tag on the AWS resource, which would be easier to build-out to a self-service feature if we chose to. And it would mean we didn't have to include details of specific important service ids etc in our (public) paas-cf deployment details.

What do you think @alphagov/team-government-paas-people

TMP

f5cf432

pauldougan reviewed May 24, 2022

View reviewed changes

source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md Outdated Show resolved Hide resolved

cadmiumcat reviewed Jun 10, 2022

View reviewed changes

AP-Hunt reviewed Jun 15, 2022

View reviewed changes

Adjustments to the ADR as part of review. Squash before merging.

7ebe694

AP-Hunt force-pushed the ris-adr-aws-backing-service-maintenance branch from 30524cf to 7ebe694 Compare August 12, 2022 15:27

AP-Hunt requested review from wryobservations, pauldougan and schmie August 12, 2022 15:28

risicle commented Aug 15, 2022

View reviewed changes

AP-Hunt requested a review from nimalank7 August 17, 2022 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR: Handling AWS-based backing service maintenance #441

ADR: Handling AWS-based backing service maintenance #441

risicle commented May 23, 2022 •

edited

Loading

pauldougan commented May 24, 2022 •

edited

Loading

pauldougan commented May 24, 2022

pauldougan May 24, 2022

risicle May 24, 2022

cadmiumcat Jun 10, 2022

cadmiumcat Jun 10, 2022

risicle Jun 10, 2022

risicle Jun 10, 2022

cadmiumcat Jun 10, 2022

wryobservations Jun 10, 2022

risicle Jun 14, 2022 •

edited

Loading

AP-Hunt Jun 15, 2022

risicle Jun 15, 2022

AP-Hunt commented Aug 12, 2022

risicle Aug 15, 2022

AP-Hunt Aug 17, 2022


		Both service types have a concept of a "maintenance window", a time period when most maintenance will be scheduled to take place. This time period is selectable in AWS, but we only currently expose this configurability to tenants in the RDS broker.

		Only very critical updates will be force-applied in this maintenance window. For less critical updates, we (the PaaS team) receive a notification that a new piece of maintenance needs to take place, and we can then make a choice as to when the maintenance occurs. For RDS, one of these choices is to perform it in the "next maintenance window". The problem with this though is we have over 800 databases in the London region alone, and the people who are best placed to know when's best to apply these updates are tenants. So we can find ourselves with a lot of communication to do to a lot of tenants, many of whom will have further questions or want boutique actions taken. In short, it's a support bomb. As a result, most of the lower priority updates sit in our "TODO" pile indefinitely and eventually get force-applied by AWS at the end of their acceptable deferral period. This is far from ideal, and is arguably as disruptive as applying all mandatory updates in the first available maintenance window, as maintenance packages will eventually be reaching the end of their deferrable period at approximately same rate as new ones appear, just many months late. RDS also has non-mandatory maintenance updates, which in theory can be deferred indefinitely, but these don't seem to have been common lately.

ADR: Handling AWS-based backing service maintenance #441

Are you sure you want to change the base?

ADR: Handling AWS-based backing service maintenance #441

Conversation

risicle commented May 23, 2022 • edited Loading

What

How to review

Who can review

pauldougan commented May 24, 2022 • edited Loading

pauldougan commented May 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

risicle Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AP-Hunt commented Aug 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

risicle commented May 23, 2022 •

edited

Loading

pauldougan commented May 24, 2022 •

edited

Loading

risicle Jun 14, 2022 •

edited

Loading