Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an optional limit for concurrent CreateAuditStream operations #41957

Merged
merged 4 commits into from
May 24, 2024

Conversation

espadolini
Copy link
Contributor

This PR adds an unstable envvar (TELEPORT_UNSTABLE_CREATEAUDITSTREAM_INFLIGHT_LIMIT) to limit the allowed in-flight CreateAuditStream RPCs on the Auth Service's gRPC server. Rejected streams are immediately closed with a trace.ConnectionProblemError, which results in a gRPC unavailable code. Current agents attempting to upload sessions will not even notice the error, they'll just get a (short) timeout while waiting for the upload status, and then they'll back off (linearly, from 5 to 500 seconds). The circuit breaker on the clients only checks that the stream is established correctly (which it's going to be, in the case of this sort of manual rejection of the stream), so the rest of Teleport will keep working as usual.

This change is motivated by an outage that hit a large cluster with burst activity of hundreds of thousands of SSH sessions at once - the peripheral nodes tried uploading the newly created session recordings all at once, and memory consumption skyrocketed to the point of going out of memory. With a manually configured limit, memory consumption is going to be limited.

This is going to be unsupported and undocumented (thus the TELEPORT_UNSTABLE_ envvar) until we figure out a better strategy to limit concurrent operations in general. Teleport 15+ only, as Teleport 14 control planes have to support v13 agents, which rely on the audit stream to also emit audit log events.

@espadolini espadolini added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v15 labels May 23, 2024
@espadolini espadolini requested a review from rosstimothy May 23, 2024 16:24
@github-actions github-actions bot requested review from greedy52 and hugoShaka May 23, 2024 16:24
@zmb3
Copy link
Collaborator

zmb3 commented May 23, 2024

How will this work in sync recording mode? Are we okay with preventing sessions from starting due to the limit of audit streams being open?

@espadolini
Copy link
Contributor Author

How will this work in sync recording mode?

If too many session uploads happen with no limit at the same time and all the instances of the Auth Service crash, sync recording mode is still going to be broken, right? If the cluster admin has decided that the control plane can only handle N session uploads at once, a consequence is going to be not running more than N sessions with sync recording mode.

lib/auth/grpcserver.go Outdated Show resolved Hide resolved
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from hugoShaka May 24, 2024 12:56
@espadolini espadolini force-pushed the espadolini/createauditstream-limit branch from 013af80 to dc1c97b Compare May 24, 2024 17:09
@espadolini espadolini enabled auto-merge May 24, 2024 17:16
@espadolini espadolini added this pull request to the merge queue May 24, 2024
Merged via the queue into master with commit aa26424 May 24, 2024
37 checks passed
@espadolini espadolini deleted the espadolini/createauditstream-limit branch May 24, 2024 17:44
@public-teleport-github-review-bot

@espadolini See the table below for backport results.

Branch Result
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/branch/v15 no-changelog Indicates that a PR does not require a changelog entry size/sm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants