-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
forget previous generation #108
Conversation
c43c2d4
to
cc12036
Compare
while trying to fix the tests, I found 2 issues in a test which worked before, as far as I can tell because of favorable timing.
|
// send a sentinel element to update the max_version. Otherwise the node's vision | ||
// of max_version will be 0, and it may accept writes that are supposed to be | ||
// stale, but it can tell they are. | ||
if !delta_writer.add_kv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch does not seem unit-tested.
I think it is possible for a Node 4 to send Node 3 info about Node 2 with a higher sequence number. Node 3 would then update it as alive for a little bit. Can you confirm that this scenario will just result in Node 3 thinking for a few seconds that Node 2 is alive. |
Can we merge this? If you have identified more faulty corner case, please open an issue to document them. |
fix #94
I figured what was preventing gc.
Both Node1 and Node3 knows about Node2, but none track its state in their failure detector
what I implemented to fix that is that on first learning about a node, it's recorded in the failure detector as dead. if we get a 2nd heartbeat (with higher sequence number), we update to alive. That way a node in state should always be in the failure detector. We add it as dead because otherwise each time Node1 forget, the message from Node3 makes it believe that Node2 is alive, then Node3 forget, Node1 makes it believe Node2 is alive, then Node1 forget.... (this is only an issue if
dead_node_grace_period
can be smaller than the time it takes to detect a node is dead)currently tests don't pass, and I have not been successful at fixing them. I left a comment with what I've figured so far at the place where it fails. Any help welcome!