-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR: Handling AWS-based backing service maintenance #441
base: main
Are you sure you want to change the base?
Conversation
I have added the RFC idea to the retro board for the team to discuss today |
I'll add the RFC idea to the retro board for the team to discuss today |
This flag could also have a third setting, which would allow the tenant to only apply "mandatory" maintenance eagerly. | ||
The default value for this flag is open for discussion, though leaving eager updates "off" by default would fail to solve the bulk of the problem. "Eager maintenance" is probably most useful to tenants who aren't attentive enough to discover this flag and turn it on. | ||
|
||
As a stretch goal, we could devise a mechanism to communicate upcoming maintenance to tenants, though this has some significant caveats: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps an interface in PaaS admin to permit org managers to view and set maintenance windows on an org wide basis? This would need some user research I imagine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in 2025.
source/architecture_decision_records/ADR0050_handling_aws_backing_service_maintenance.md
Outdated
Show resolved
Hide resolved
|
||
- The ability to set a service instance's maintenance window. The RDS broker already allows this. | ||
- A periodic job that will trigger available maintenance for a managed service instance in its maintenance window. The Elasticache broker would need a job that ran *at least* every hour to guarantee running within any possible maintenance window (which can be minimum 1 hour wide) because Elasticache lacks the ability to set maintenance to run in the "next maintenance period". There would be advantages to using this technique in the RDS broker too - use of the "apply at next maintenance window" feature means there's a time period *between* then and the start of that maintenance period where it's too late to stop a piece of maintenance using the flag described below. Using our own broker to schedule the maintenance avoids this danger. The RDS broker already has a "cron" mechanism which it uses for deleting old snapshots. Note that the minimum maintenance window for an RDS instance is half an hour, meaning our periodic job would need to run twice as frequently. | ||
- A flag on service instances that would allow the above "eager maintenance" feature to be turned off. A tenant might want to use this flag when they have a particularly busy or critical period approaching for their service. Using this would give them an "update holiday", which the periodic job would catch up on missed maintenance. Tenants that were extremely averse to updates could leave this flag off permanently, which would result in something close to the current situation - important updates being force-applied by AWS after the maximum deferral period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means we need to let users know the "apply by" date for each update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There "severity" of elasticache updates:
- critical: Recommended to apply immediately (within 14 days or less)
- important: Recommended to apply as soon as your business flow allows (within 30 days or less)
- medium: Recommended to apply within 60 days or less
- low: Recommended to apply within 90 days or less
I think the "update holiday" works with important
, medium
and low
updates. But for critical
updates, should we allow this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means we need to let users know the "apply by" date for each update?
Not necessarily. When you go on holiday you set your answerphone, but you don't necessarily know whether anyone's going to call or not.
(granted, 20 year out of date example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as the severity levels go, it's not terribly well-defined how they relate to updates being "mandatory" and my tendency would be to allow AWS to be the ones to decide which ones are mandatory or not, and not make any extra stipulations of our own.
Much of the idea of this flag is to allow a tenant to opt out entirely of our new eager-update feature if they find it doesn't fit their needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say you're planning to have a "holiday" for a month, but there is a mandatory update happening on the first week of it because that's its due date - what happens then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking I've understood - currently we choose to apply updates, thus communicating a maintenance window, could result in some downtime, tenants react variously, huge support burden. Proposal - we automate maintenance as default giving tenants the ability to override this as and when or permanently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cadmiumcat well, if it's in the first week it's likely that it got eagerly-applied before you went into holiday mode because the minimum forced-application time appears to be 14 days. If it hadn't for some reason, it would happen during the holiday, because there's nothing an RDS tenant can do to stop that.
@wryobservations what we "currently do" is quite inconsistent. We only occasionally apply a mass update because it's currently painful. Most the time if we're honest we do the equivalent of having permanent holiday mode, letting only the critical updates happen quite late.
|
||
Both service types have a concept of a "maintenance window", a time period when most maintenance will be scheduled to take place. This time period is selectable in AWS, but we only currently expose this configurability to tenants in the RDS broker. | ||
|
||
Only very critical updates will be force-applied in this maintenance window. For less critical updates, we (the PaaS team) receive a notification that a new piece of maintenance needs to take place, and we can then make a choice as to when the maintenance occurs. For RDS, one of these choices is to perform it in the "next maintenance window". The problem with this though is we have over 800 databases in the London region alone, and the people who are best placed to know when's best to apply these updates are tenants. So we can find ourselves with a lot of communication to do to a lot of tenants, many of whom will have further questions or want boutique actions taken. In short, it's a support bomb. As a result, most of the lower priority updates sit in our "TODO" pile indefinitely and eventually get force-applied by AWS at the end of their acceptable deferral period. This is far from ideal, and is arguably as disruptive as applying all mandatory updates in the first available maintenance window, as maintenance packages will eventually be reaching the end of their deferrable period at approximately same rate as new ones appear, just **many** months late. RDS also has non-mandatory maintenance updates, which in theory can be deferred indefinitely, but these don't seem to have been common lately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any idea how many times we've had emails about security updates to RDS instances? My feeling is it's not many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I can't recall if they ever explicitly mention specifically security in their notifications
30524cf
to
7ebe694
Compare
I've made some changes to make the ADR match our discussions. I'd appreciate some input on paragraphs 4 and 5 of the decision section. |
- The more tenants see about these updates, the more likely we are to get boutique support requests for special handling of their service instance and information on whether a specific update has been applied to their service. | ||
- If we publish this information in a general, whole-platform way, (easy for Elasticache, much harder for RDS), we leave tenants to figure out for themselves which maintenance item applies to each of their instances. | ||
- If we push this information to them, per-instance, we're just passing on the noise to them that we ourselves find so hard to handle. We also don't have a great way of targeting the relevant users - we tend to email org managers, which, in large organizations, can be a very blunt tool. | ||
Both brokers will need a mechanism for disabling maintenance updates on service instances. Given the impending decomissioning of the platform, we feel that building this out in a self-service mannner could be unnecessarily burdonsome. Having a list of identifiers in a config file which we update in response to tenant tickets would likely suffice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could manually set a tag on the AWS resource, which would be easier to build-out to a self-service feature if we chose to. And it would mean we didn't have to include details of specific important service ids etc in our (public) paas-cf deployment details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think @alphagov/team-government-paas-people
What
Perhaps this shouldn't be an ADR as it is possibly more of a RFC, but we don't have a mechanism for RFCs right now. Instead here's a draft PR of the ADR to take a look at, and perhaps people can discuss whether this is a good idea here.
How to review
👀
Who can review
Humans