-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to know if DemotePrimary is blocked and send it in the health stream #17289
base: main
Are you sure you want to change the base?
Conversation
…reamer Signed-off-by: Manan Gupta <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17289 +/- ##
==========================================
+ Coverage 67.40% 67.52% +0.11%
==========================================
Files 1574 1581 +7
Lines 253205 253969 +764
==========================================
+ Hits 170676 171492 +816
+ Misses 82529 82477 -52 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
0c20bed
to
806c8e6
Compare
806c8e6
to
0c20bed
Compare
// We waited for over 10 times of remote operation timeout, but DemotePrimary is still not done. | ||
// Collect more information and signal demote primary is indefinitely stalled. | ||
log.Errorf("DemotePrimary seems to be stalled. Collecting more information.") | ||
tm.QueryServiceControl.SetDemotePrimaryStalled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to ever reset this. Is that because once it is stalled the only solution is to restart the tablet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we read the end of demotePrimary
and we have called SetDemotePrimaryStalled
, what is the correct course of action? it seems like we're assuming this will never happen. should we do something like block forever without returning, to ensure that the tablet doesn't accidentally make forward progress or re-enter the set of serving tablets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is an inherent race between the finishCtx
completing (DemotePrimary finishing) and the timeout triggering. For that matter, DemotePrimary
can unblock and finish, after we've marked the tablet as Stalled. If it is successful, even then I don't really see an issue with the tablet rejoining the serving tablets, until it is eventually restarted by the operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even then I don't really see an issue with the tablet rejoining the serving tablets, until it is eventually restarted by the operator.
hm I think that is potentially a problem. because if the operator gets notified that a tablet is stalled, it's going to forcefully throw that tablet away with the assumption that (a) there is another tablet that is the real primary and (b) the stalled primary is not serving any traffic. if the stalled primary is able to rejoin the set of serving tablets, both of those assumptions go out the window, and it is unsafe for the operator to safely throw it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true, this would trigger an ERS. Let me see if we can make the tablet not become serving ever again if it is stalled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I added few more safeties to ensure nothing goes wrong -
- After we set
DemotePrimaryStalled
we immediately trigger a health check update, which would make vtgate mark this tablet not-serving and not send it any requests ever again, because we never clear the field. - For any requests already sent, if
DemotePrimaryStalled
is set, we won't process it on vttablet and instead just return an error.
I think with these safeguards we can be sure htat a vttablet is not going to accept any new writes once we mark it as stalled.
WDYT @maxenglander? Let's also wait for @deepthi to be able to take a look.
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
…es too longs Signed-off-by: Manan Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
Description
This PR adds the feature requested in #17288.
We spawn a new goroutine when we start
DemotePrimary
and when enough time has passed such that DemotePrimary hasn't finished despite context cancellation, we deem it to be blocked. In this case we send an error in the health stream that the users can track and use to restart MySQL and vttablet.Related Issue(s)
DemotePrimary
is blocked #17288Checklist
Deployment Notes