kv: ability to gracefully recover from large intent buildup #135934
Labels
A-admission-control
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-1
Issues/test failures with a fix SLA of 1 month
T-kv
KV Team
Is your feature request related to a problem? Please describe.
In a customer case, we saw almost 10 billion intents created over multiple ranges over a 5 hour window from a single
INSERT INTO ... SELECT FROM
statement that was ultimately killed. The MVCC GC queue ended up causing severe LSM inversion which ultimately caused a 30+ minute outage on a cluster. We need the ability to recover gracefully when we detect this type of problem.Describe the solution you'd like
There are two high level approaches to gracefully recovering:
Current the
mvccGCQueue
callsCleanupTxnIntentsOnGCAsync
which ends up callingcleanupFinishedTxnIntents
.See #97108 for more details on the current status of AC and intent resolution.
Jira issue: CRDB-44794
The text was updated successfully, but these errors were encountered: