Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR: Handling AWS-based backing service maintenance #441

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

risicle
Copy link
Member

@risicle risicle commented May 23, 2022

What

Perhaps this shouldn't be an ADR as it is possibly more of a RFC, but we don't have a mechanism for RFCs right now. Instead here's a draft PR of the ADR to take a look at, and perhaps people can discuss whether this is a good idea here.

How to review

👀

Who can review

Humans

@pauldougan
Copy link
Member

pauldougan commented May 24, 2022

I have added the RFC idea to the retro board for the team to discuss today

@pauldougan
Copy link
Member

I'll add the RFC idea to the retro board for the team to discuss today

This flag could also have a third setting, which would allow the tenant to only apply "mandatory" maintenance eagerly.
The default value for this flag is open for discussion, though leaving eager updates "off" by default would fail to solve the bulk of the problem. "Eager maintenance" is probably most useful to tenants who aren't attentive enough to discover this flag and turn it on.

As a stretch goal, we could devise a mechanism to communicate upcoming maintenance to tenants, though this has some significant caveats:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps an interface in PaaS admin to permit org managers to view and set maintenance windows on an org wide basis? This would need some user research I imagine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in 2025.


- The ability to set a service instance's maintenance window. The RDS broker already allows this.
- A periodic job that will trigger available maintenance for a managed service instance in its maintenance window. The Elasticache broker would need a job that ran *at least* every hour to guarantee running within any possible maintenance window (which can be minimum 1 hour wide) because Elasticache lacks the ability to set maintenance to run in the "next maintenance period". There would be advantages to using this technique in the RDS broker too - use of the "apply at next maintenance window" feature means there's a time period *between* then and the start of that maintenance period where it's too late to stop a piece of maintenance using the flag described below. Using our own broker to schedule the maintenance avoids this danger. The RDS broker already has a "cron" mechanism which it uses for deleting old snapshots. Note that the minimum maintenance window for an RDS instance is half an hour, meaning our periodic job would need to run twice as frequently.
- A flag on service instances that would allow the above "eager maintenance" feature to be turned off. A tenant might want to use this flag when they have a particularly busy or critical period approaching for their service. Using this would give them an "update holiday", which the periodic job would catch up on missed maintenance. Tenants that were extremely averse to updates could leave this flag off permanently, which would result in something close to the current situation - important updates being force-applied by AWS after the maximum deferral period.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means we need to let users know the "apply by" date for each update?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There "severity" of elasticache updates:

  1. critical: Recommended to apply immediately (within 14 days or less)
  2. important: Recommended to apply as soon as your business flow allows (within 30 days or less)
  3. medium: Recommended to apply within 60 days or less
  4. low: Recommended to apply within 90 days or less

I think the "update holiday" works with important, medium and low updates. But for critical updates, should we allow this as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means we need to let users know the "apply by" date for each update?

Not necessarily. When you go on holiday you set your answerphone, but you don't necessarily know whether anyone's going to call or not.

(granted, 20 year out of date example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as the severity levels go, it's not terribly well-defined how they relate to updates being "mandatory" and my tendency would be to allow AWS to be the ones to decide which ones are mandatory or not, and not make any extra stipulations of our own.

Much of the idea of this flag is to allow a tenant to opt out entirely of our new eager-update feature if they find it doesn't fit their needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say you're planning to have a "holiday" for a month, but there is a mandatory update happening on the first week of it because that's its due date - what happens then?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking I've understood - currently we choose to apply updates, thus communicating a maintenance window, could result in some downtime, tenants react variously, huge support burden. Proposal - we automate maintenance as default giving tenants the ability to override this as and when or permanently?

Copy link
Member Author

@risicle risicle Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cadmiumcat well, if it's in the first week it's likely that it got eagerly-applied before you went into holiday mode because the minimum forced-application time appears to be 14 days. If it hadn't for some reason, it would happen during the holiday, because there's nothing an RDS tenant can do to stop that.

@wryobservations what we "currently do" is quite inconsistent. We only occasionally apply a mass update because it's currently painful. Most the time if we're honest we do the equivalent of having permanent holiday mode, letting only the critical updates happen quite late.


Both service types have a concept of a "maintenance window", a time period when most maintenance will be scheduled to take place. This time period is selectable in AWS, but we only currently expose this configurability to tenants in the RDS broker.

Only very critical updates will be force-applied in this maintenance window. For less critical updates, we (the PaaS team) receive a notification that a new piece of maintenance needs to take place, and we can then make a choice as to when the maintenance occurs. For RDS, one of these choices is to perform it in the "next maintenance window". The problem with this though is we have over 800 databases in the London region alone, and the people who are best placed to know when's best to apply these updates are tenants. So we can find ourselves with a lot of communication to do to a lot of tenants, many of whom will have further questions or want boutique actions taken. In short, it's a support bomb. As a result, most of the lower priority updates sit in our "TODO" pile indefinitely and eventually get force-applied by AWS at the end of their acceptable deferral period. This is far from ideal, and is arguably as disruptive as applying all mandatory updates in the first available maintenance window, as maintenance packages will eventually be reaching the end of their deferrable period at approximately same rate as new ones appear, just **many** months late. RDS also has non-mandatory maintenance updates, which in theory can be deferred indefinitely, but these don't seem to have been common lately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea how many times we've had emails about security updates to RDS instances? My feeling is it's not many.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I can't recall if they ever explicitly mention specifically security in their notifications

@AP-Hunt AP-Hunt force-pushed the ris-adr-aws-backing-service-maintenance branch from 30524cf to 7ebe694 Compare August 12, 2022 15:27
@AP-Hunt
Copy link
Member

AP-Hunt commented Aug 12, 2022

I've made some changes to make the ADR match our discussions. I'd appreciate some input on paragraphs 4 and 5 of the decision section.

- The more tenants see about these updates, the more likely we are to get boutique support requests for special handling of their service instance and information on whether a specific update has been applied to their service.
- If we publish this information in a general, whole-platform way, (easy for Elasticache, much harder for RDS), we leave tenants to figure out for themselves which maintenance item applies to each of their instances.
- If we push this information to them, per-instance, we're just passing on the noise to them that we ourselves find so hard to handle. We also don't have a great way of targeting the relevant users - we tend to email org managers, which, in large organizations, can be a very blunt tool.
Both brokers will need a mechanism for disabling maintenance updates on service instances. Given the impending decomissioning of the platform, we feel that building this out in a self-service mannner could be unnecessarily burdonsome. Having a list of identifiers in a config file which we update in response to tenant tickets would likely suffice.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could manually set a tag on the AWS resource, which would be easier to build-out to a self-service feature if we chose to. And it would mean we didn't have to include details of specific important service ids etc in our (public) paas-cf deployment details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think @alphagov/team-government-paas-people

@AP-Hunt AP-Hunt requested a review from nimalank7 August 17, 2022 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants