-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/stats: fix panic catching in the stats cache, show histogram #135944
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good place. Another possibility would be to make it even tighter by moving it below the call to sc.db.Executor().QueryIteratorEx
to better match ShowTableStats
, but I'm not sure that really gets us anything.
It might be nice to also add panic catching to ShowHistogram
to save our future selves some pain.
With this in place, I don't think we really need to make addCacheEntryLocked
and refreshCacheEntry
panic-safe. Might be nice to leave a comment that they're not panic-safe?
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/stats/stats_cache.go
line 842 at r1 (raw file):
if err != nil { log.Warningf(ctx, "could not decode statistic for table %d: %v", tableID, err) continue
If we panic while parsing one statistic, it would be nice to try to fall back to the next statistic. So maybe we should also put panic catching in parseStats
itself (or even lower) so that we hit this continue
.
This commit removes a top-level panic catcher from GetTableStats in the stats cache, and instead adds a couple of lower level panic catchers. The problems with the top-level catcher are: - There are other entry points into the stats cache besides GetTableStats, so it missed some cases. - Because the stats cache contains shared state, condition variables, mutexes, etc, it was possible to get into a scenario where a panic would bypass important logic (such as broadcasting when a resource is available), but the panic would be caught at the top level and leave the cache in a bad state. The new panic catchers are below the level of this shared state manipulation, but also ensure that all paths into the stats cache that might panic are covered. This commit also adds panic catching to SHOW HISTOGRAM so that decoding errors do not crash the process when running SHOW HISTOGRAM. Fixes cockroachdb#135940 Release note (bug fix): Fixed an issue where corrupted table statistics could cause the cockroach process to crash.
995d2d1
to
2484e05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good place. Another possibility would be to make it even tighter by moving it below the call to
sc.db.Executor().QueryIteratorEx
to better matchShowTableStats
, but I'm not sure that really gets us anything.
Done. Good call, I think it's better to avoid including the DB call.
It might be nice to also add panic catching to
ShowHistogram
to save our future selves some pain.
Done.
With this in place, I don't think we really need to make
addCacheEntryLocked
andrefreshCacheEntry
panic-safe. Might be nice to leave a comment that they're not panic-safe?
Well, moving the panic catching below QueryIteratorEx
means that these functions could now experience catchable panics. But I think the reason we ran into problems with the Broadcast
not firing before was because of the top-level panic catcher in GetTableStats
. I think that panic catcher was unsafe, so I've removed it (since we're now catching the relevant panics at a lower level). Given this subtlety, I think a comment in those functions would create more confusion than would be helpful...
Take a look at my new commit / PR message and let me know if you agree.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)
pkg/sql/stats/stats_cache.go
line 842 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
If we panic while parsing one statistic, it would be nice to try to fall back to the next statistic. So maybe we should also put panic catching in
parseStats
itself (or even lower) so that we hit thiscontinue
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good points, I agree. this is nice!
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
-- commits
line 15 at r2:
nice commit message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2)
Previously, michae2 (Michael Erickson) wrote…
nice commit message
Thanks!
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #135940: branch-release-24.3. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 2484e05 to blathers/backport-release-23.2-135944: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit removes a top-level panic catcher from
GetTableStats
in the statscache, and instead adds a couple of lower level panic catchers. The problems
with the top-level catcher are:
GetTableStats
, soit missed some cases.
etc, it was possible to get into a scenario where a panic would bypass
important logic (such as broadcasting when a resource is available), but the
panic would be caught at the top level and leave the cache in a bad state.
The new panic catchers are below the level of this shared state manipulation,
but also ensure that all paths into the stats cache that might panic are
covered.
This commit also adds panic catching to
SHOW HISTOGRAM
so that decoding errorsdo not crash the process when running
SHOW HISTOGRAM
.Fixes #135940
Release note (bug fix): Fixed an issue where corrupted table statistics could cause the cockroach process to crash.