Add an optional limit for concurrent CreateAuditStream
operations
#41957
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds an unstable envvar (
TELEPORT_UNSTABLE_CREATEAUDITSTREAM_INFLIGHT_LIMIT
) to limit the allowed in-flightCreateAuditStream
RPCs on the Auth Service's gRPC server. Rejected streams are immediately closed with atrace.ConnectionProblemError
, which results in a gRPCunavailable
code. Current agents attempting to upload sessions will not even notice the error, they'll just get a (short) timeout while waiting for the upload status, and then they'll back off (linearly, from 5 to 500 seconds). The circuit breaker on the clients only checks that the stream is established correctly (which it's going to be, in the case of this sort of manual rejection of the stream), so the rest of Teleport will keep working as usual.This change is motivated by an outage that hit a large cluster with burst activity of hundreds of thousands of SSH sessions at once - the peripheral nodes tried uploading the newly created session recordings all at once, and memory consumption skyrocketed to the point of going out of memory. With a manually configured limit, memory consumption is going to be limited.
This is going to be unsupported and undocumented (thus the
TELEPORT_UNSTABLE_
envvar) until we figure out a better strategy to limit concurrent operations in general. Teleport 15+ only, as Teleport 14 control planes have to support v13 agents, which rely on the audit stream to also emit audit log events.