Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to know if DemotePrimary is blocked and send it in the health stream #17289
base: main
Are you sure you want to change the base?
Add a way to know if DemotePrimary is blocked and send it in the health stream #17289
Changes from all commits
1863256
fcfd133
0c20bed
743db09
3dd0376
e278e11
0b8c3f4
7258704
50f311a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to ever reset this. Is that because once it is stalled the only solution is to restart the tablet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we read the end of
demotePrimary
and we have calledSetDemotePrimaryStalled
, what is the correct course of action? it seems like we're assuming this will never happen. should we do something like block forever without returning, to ensure that the tablet doesn't accidentally make forward progress or re-enter the set of serving tablets?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is an inherent race between the
finishCtx
completing (DemotePrimary finishing) and the timeout triggering. For that matter,DemotePrimary
can unblock and finish, after we've marked the tablet as Stalled. If it is successful, even then I don't really see an issue with the tablet rejoining the serving tablets, until it is eventually restarted by the operator.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I think that is potentially a problem. because if the operator gets notified that a tablet is stalled, it's going to forcefully throw that tablet away with the assumption that (a) there is another tablet that is the real primary and (b) the stalled primary is not serving any traffic. if the stalled primary is able to rejoin the set of serving tablets, both of those assumptions go out the window, and it is unsafe for the operator to safely throw it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true, this would trigger an ERS. Let me see if we can make the tablet not become serving ever again if it is stalled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I added few more safeties to ensure nothing goes wrong -
DemotePrimaryStalled
we immediately trigger a health check update, which would make vtgate mark this tablet not-serving and not send it any requests ever again, because we never clear the field.DemotePrimaryStalled
is set, we won't process it on vttablet and instead just return an error.I think with these safeguards we can be sure htat a vttablet is not going to accept any new writes once we mark it as stalled.
WDYT @maxenglander? Let's also wait for @deepthi to be able to take a look.