Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(meta): fix receiving scheduled command on blocked database #20241

Merged
merged 3 commits into from
Jan 21, 2025

Conversation

wenym1
Copy link
Contributor

@wenym1 wenym1 commented Jan 21, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Discovered in https://buildkite.com/risingwavelabs/main-cron/builds/4257#_

Get the following panic in madsim test.

thread '<unnamed>' panicked at src/meta/src/barrier/checkpoint/control.rs:453:17:
should be at running: should not have command when not running

The assertion is based on the assumption that when a database is in recovery, it should not receive command. We have marked the database as Blocked when it enters recovery, so normal command won't be able to add to the command queue. However, we still allow DropXxx command to be enqueued, and since we didn't prevent polling the queue of the recovering database, we will still receive this command and cause panic in global barrier manager.

In this PR, we will change to avoid polling command queue of any blocked database. When the database is marked ready, we will notify on any change subscriber if the database queue is not empty.

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@wenym1 wenym1 force-pushed the yiming/reproduce-expect-command-db-running branch from 7da7d44 to 2b1bf97 Compare January 21, 2025 09:56
Copy link

gru-agent bot commented Jan 21, 2025

This pull request has been modified. If you want me to regenerate unit test for any of the files related, please find the file in "Files Changed" tab and add a comment @gru-agent. (The github "Comment on this file" feature is in the upper right corner of each file in "Files Changed" tab.)

@wenym1 wenym1 changed the title fix: reproduce expect command db running fix(meta): fix receiving scheduled command on blocked database Jan 21, 2025
@wenym1 wenym1 requested a review from shanicky January 21, 2025 10:01
Copy link
Contributor

@kwannoel kwannoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we cannot drop stream jobs until db recovered?

@wenym1
Copy link
Contributor Author

wenym1 commented Jan 21, 2025

Does this mean we cannot drop stream jobs until db recovered?

No. The stream job will have been removed from catalog before we enqueue the command. The command can still be enqueue, and will take affect either by pre_apply_drop_cancel_scheduled during recovery, or by the normal handling logic after db has recovered and resumed running.

@wenym1 wenym1 enabled auto-merge January 21, 2025 10:17
@wenym1 wenym1 added this pull request to the merge queue Jan 21, 2025
Merged via the queue into main with commit 33bf9ba Jan 21, 2025
40 of 42 checks passed
@wenym1 wenym1 deleted the yiming/reproduce-expect-command-db-running branch January 21, 2025 11:21
github-merge-queue bot pushed a commit that referenced this pull request Jan 22, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants