-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the CRE (possibly infinite) loop from the end of state transfer. #2995
base: master
Are you sure you want to change the base?
Conversation
looks risky. We need to test State transfer scenarios |
There are Apollo tests that test that. But scheduling a deterministic test for testing this scenario is not supported at the moment |
@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP? From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI. |
// manage to complete state transfer, CRE is halted and we need to execute the command based on the state we gained | ||
// during state transfer. In order not to execute the command twice, we do check that this configuration was not | ||
// already executed by reading the local configuration list. | ||
std::ofstream configurations_file; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for safety and better use of new built-in C++17 features, consider using std::filesystem for the path:
https://en.cppreference.com/w/cpp/filesystem
https://en.cppreference.com/w/cpp/filesystem/path
Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications? |
@glevkovich , I'm referring to this piece of code:
AFAIU, if the replica has crashed during In addition, it's rather a general warning: |
To test the specific scenario, we need to somehow be able to exactly to crash a replica exactly where the loop was before and before its CRE was able to fetch the update. AFAIN, we don't have this capability at the moment. So good tests (IMO) are simply to see that the relevant tests do not crash (this PR run for 4 times already and all was good) |
Hi @yontyon, can you please try the following (currently disabled) test with your PR? I'd suggest to run it in a loop, as a confirmation that your changes here address the underlying test instability. Thanks! |
@yontyon |
94bc615
to
212f5cc
Compare
some of the reconfiguration commands may introduce a certain complexity with state transfer: they may break the ability to complete state transfer (and getting the reconfiguration update). A good example of such a command is the scaling command.
The current solution is to have another reconfiguration mechanism within state transfer - based on the client reconfiguration engine.
The rationale behind this approach is that a replica on state transfer behaves identically to a full copy client (read-only replica) and should get the reconfiguration updates with the same async mechanism as every other client.
However, this means that in order to make sure we got the reconfiguration updates via CRE (the async mechanism) is that we have to make a synchronization phase. In the context of CRE, this was done by making sure we get a newer update than the latest one we got in state transfer.
Unfortunately, getting a bft update requires getting replies from 2f + 1 replica (in the regular case), which means that to finish state transfer we must have other 2f + 1 replicas.
After carefully analyzing the problem, it appears we can have the two mechanisms as long as we sync between them.
We have two options:
Notes:
CI