Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/stats: fix panic catching in the stats cache, show histogram #135944

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

rytaft
Copy link
Collaborator

@rytaft rytaft commented Nov 21, 2024

This commit removes a top-level panic catcher from GetTableStats in the stats
cache, and instead adds a couple of lower level panic catchers. The problems
with the top-level catcher are:

  • There are other entry points into the stats cache besides GetTableStats, so
    it missed some cases.
  • Because the stats cache contains shared state, condition variables, mutexes,
    etc, it was possible to get into a scenario where a panic would bypass
    important logic (such as broadcasting when a resource is available), but the
    panic would be caught at the top level and leave the cache in a bad state.

The new panic catchers are below the level of this shared state manipulation,
but also ensure that all paths into the stats cache that might panic are
covered.

This commit also adds panic catching to SHOW HISTOGRAM so that decoding errors
do not crash the process when running SHOW HISTOGRAM.

Fixes #135940

Release note (bug fix): Fixed an issue where corrupted table statistics could cause the cockroach process to crash.

@rytaft rytaft requested a review from michae2 November 21, 2024 21:05
@rytaft rytaft requested a review from a team as a code owner November 21, 2024 21:05
Copy link

blathers-crl bot commented Nov 21, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rytaft rytaft added backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3 labels Nov 21, 2024
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I think this is a good place. Another possibility would be to make it even tighter by moving it below the call to sc.db.Executor().QueryIteratorEx to better match ShowTableStats, but I'm not sure that really gets us anything.

It might be nice to also add panic catching to ShowHistogram to save our future selves some pain.

With this in place, I don't think we really need to make addCacheEntryLocked and refreshCacheEntry panic-safe. Might be nice to leave a comment that they're not panic-safe?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)


pkg/sql/stats/stats_cache.go line 842 at r1 (raw file):

		if err != nil {
			log.Warningf(ctx, "could not decode statistic for table %d: %v", tableID, err)
			continue

If we panic while parsing one statistic, it would be nice to try to fall back to the next statistic. So maybe we should also put panic catching in parseStats itself (or even lower) so that we hit this continue.

This commit removes a top-level panic catcher from GetTableStats in the stats
cache, and instead adds a couple of lower level panic catchers. The problems
with the top-level catcher are:
- There are other entry points into the stats cache besides GetTableStats, so
  it missed some cases.
- Because the stats cache contains shared state, condition variables, mutexes,
  etc, it was possible to get into a scenario where a panic would bypass
  important logic (such as broadcasting when a resource is available), but the
  panic would be caught at the top level and leave the cache in a bad state.
The new panic catchers are below the level of this shared state manipulation,
but also ensure that all paths into the stats cache that might panic are
covered.

This commit also adds panic catching to SHOW HISTOGRAM so that decoding errors
do not crash the process when running SHOW HISTOGRAM.

Fixes cockroachdb#135940

Release note (bug fix): Fixed an issue where corrupted table statistics could
cause the cockroach process to crash.
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good place. Another possibility would be to make it even tighter by moving it below the call to sc.db.Executor().QueryIteratorEx to better match ShowTableStats, but I'm not sure that really gets us anything.

Done. Good call, I think it's better to avoid including the DB call.

It might be nice to also add panic catching to ShowHistogram to save our future selves some pain.

Done.

With this in place, I don't think we really need to make addCacheEntryLocked and refreshCacheEntry panic-safe. Might be nice to leave a comment that they're not panic-safe?

Well, moving the panic catching below QueryIteratorEx means that these functions could now experience catchable panics. But I think the reason we ran into problems with the Broadcast not firing before was because of the top-level panic catcher in GetTableStats. I think that panic catcher was unsafe, so I've removed it (since we're now catching the relevant panics at a lower level). Given this subtlety, I think a comment in those functions would create more confusion than would be helpful...

Take a look at my new commit / PR message and let me know if you agree.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)


pkg/sql/stats/stats_cache.go line 842 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

If we panic while parsing one statistic, it would be nice to try to fall back to the next statistic. So maybe we should also put panic catching in parseStats itself (or even lower) so that we hit this continue.

Done.

@rytaft rytaft changed the title sql/stats: add more panic catching to the stats cache sql/stats: fix panic catching in the stats cache, show histogram Nov 22, 2024
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: good points, I agree. this is nice!

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)


-- commits line 15 at r2:
nice commit message

@rytaft rytaft requested a review from michae2 November 22, 2024 21:06
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2)


-- commits line 15 at r2:

Previously, michae2 (Michael Erickson) wrote…

nice commit message

Thanks!

@craig craig bot merged commit 646bf4e into cockroachdb:master Nov 22, 2024
22 of 23 checks passed
Copy link

blathers-crl bot commented Nov 22, 2024

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #135940: branch-release-24.3.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Nov 22, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2484e05 to blathers/backport-release-23.2-135944: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: panic when decoding table stats can crash the process
3 participants