Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the CRE (possibly infinite) loop from the end of state transfer. #2995

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yontyon
Copy link
Contributor

@yontyon yontyon commented Apr 13, 2023

  • Problem Overview
    some of the reconfiguration commands may introduce a certain complexity with state transfer: they may break the ability to complete state transfer (and getting the reconfiguration update). A good example of such a command is the scaling command.
    The current solution is to have another reconfiguration mechanism within state transfer - based on the client reconfiguration engine.
    The rationale behind this approach is that a replica on state transfer behaves identically to a full copy client (read-only replica) and should get the reconfiguration updates with the same async mechanism as every other client.
    However, this means that in order to make sure we got the reconfiguration updates via CRE (the async mechanism) is that we have to make a synchronization phase. In the context of CRE, this was done by making sure we get a newer update than the latest one we got in state transfer.
    Unfortunately, getting a bft update requires getting replies from 2f + 1 replica (in the regular case), which means that to finish state transfer we must have other 2f + 1 replicas.

After carefully analyzing the problem, it appears we can have the two mechanisms as long as we sync between them.

We have two options:

  1. the replica was able to get the reconfiguration update prior to the end of state transfer. In this case, the replica will get the new configuration and start the state transfer all over again (because state transfer has not been finished). Then, once we actually finish state transfer in the new configuration, we won't execute the scale command again as it was written to the configurations file.
  2. The replica was not able to get the update with CRE and was able to complete state transfer. Then we simply execute the reconfiguration update as any other reconfiguration command. As before, we update the configurations file with the new configuration.

Notes:

  1. Removing the synchronization phase (the loop), means that every new reconfiguration command should be developed w.r.t to the above problem (there is no generic solution)
  2. The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
  • Testing Done
    CI

@yontyon yontyon requested a review from a team as a code owner April 13, 2023 09:08
@yontyon yontyon requested a review from WildFireFlum April 13, 2023 09:09
@eyalrund
Copy link
Contributor

looks risky. We need to test State transfer scenarios

@yontyon
Copy link
Contributor Author

yontyon commented Apr 13, 2023

looks risky. We need to test State transfer scenarios

There are Apollo tests that test that. But scheduling a deterministic test for testing this scenario is not supported at the moment

@glevkovich
Copy link
Contributor

glevkovich commented Apr 20, 2023

The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.

@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?

From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.

// manage to complete state transfer, CRE is halted and we need to execute the command based on the state we gained
// during state transfer. In order not to execute the command twice, we do check that this configuration was not
// already executed by reading the local configuration list.
std::ofstream configurations_file;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glevkovich
Copy link
Contributor

@yontyon

Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications?

@yontyon
Copy link
Contributor Author

yontyon commented Apr 24, 2023

The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.

@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?

From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.

@glevkovich , I'm referring to this piece of code:

  // Mark ST final completion to "outside world" - After this call any incoming messages are not dropped.
  setExternalCollectingState(false);

  // Now after external state have changed, invoke the 2nd group of registered callbacks
  // TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.
  // worst case we will have some few incoming messages dropped, until key-exchange is done.
  // on_fetching_state_change_cb_registry_ should be removed
  auto size = on_fetching_state_change_cb_registry_.size();
  LOG_INFO(
      logger_,
      "Starting to invoke all registered calls (on_fetching_state_change_cb_registry_):" << KVLOG(size, checkpointNum));
  on_fetching_state_change_cb_registry_.invokeAll(checkpointNum);
  LOG_INFO(logger_, "Done invoking all registered calls (on_fetching_state_change_cb_registry_)");

AFAIU, if the replica has crashed during on_fetching_state_change_cb_registry_ then it won't execute again the callbacks (because state transfer has already finished).

In addition, it's rather a general warning:
Once you rely on two different mechanisms and one of them is implemented on the state transfer callback, the developer has to have in mind that its logic might be executed twice (in case the replica has crashed during the callback execution). Until now there was a general solution (the CRE loop), but now it is the developer's responsibility to handle all these edge cases.

@yontyon
Copy link
Contributor Author

yontyon commented Apr 24, 2023

@yontyon

Are current tests good enough, or do we need to add new tests to check new side cases due to current modifications?

To test the specific scenario, we need to somehow be able to exactly to crash a replica exactly where the loop was before and before its CRE was able to fetch the update. AFAIN, we don't have this capability at the moment. So good tests (IMO) are simply to see that the relevant tests do not crash (this PR run for 4 times already and all was good)

@teoparvanov
Copy link
Contributor

teoparvanov commented Apr 24, 2023

Hi @yontyon, can you please try the following (currently disabled) test with your PR?
https://github.com/vmware/concord-bft/blob/master/tests/apollo/test_skvbc_restart_recovery.py#L495

I'd suggest to run it in a loop, as a confirmation that your changes here address the underlying test instability.

Thanks!

@glevkovich
Copy link
Contributor

The proposed change does not handle some (possibly) atomic issues (such as crashing in the middle of state transfer callbacks). These problems should be analyzed and handled in a different PR.
@yontyon , if you do not handle all above side cases then my undersanding is that it is not a complete solution. If replica do crash, will it be able to continue somehow, or is it a "game over" for the replica? if the 2nd, are you planning to implement the additional PR ASAP?
From your words, some of the persistency tests will fail for sure. I suggest to disable them. replicas are killed there tens to hundreds of times per test, they will fail for sure on CI.

@glevkovich , I'm referring to this piece of code:

  // Mark ST final completion to "outside world" - After this call any incoming messages are not dropped.
  setExternalCollectingState(false);

  // Now after external state have changed, invoke the 2nd group of registered callbacks
  // TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.
  // worst case we will have some few incoming messages dropped, until key-exchange is done.
  // on_fetching_state_change_cb_registry_ should be removed
  auto size = on_fetching_state_change_cb_registry_.size();
  LOG_INFO(
      logger_,
      "Starting to invoke all registered calls (on_fetching_state_change_cb_registry_):" << KVLOG(size, checkpointNum));
  on_fetching_state_change_cb_registry_.invokeAll(checkpointNum);
  LOG_INFO(logger_, "Done invoking all registered calls (on_fetching_state_change_cb_registry_)");

AFAIU, if the replica has crashed during on_fetching_state_change_cb_registry_ then it won't execute again the callbacks (because state transfer has already finished).

In addition, it's rather a general warning: Once you rely on two different mechanisms and one of them is implemented on the state transfer callback, the developer has to have in mind that its logic might be executed twice (in case the replica has crashed during the callback execution). Until now there was a general solution (the CRE loop), but now it is the developer's responsibility to handle all these edge cases.

@yontyon
Yes, ST is done at this stage. I wonder if we still need this callback here? After all, it is not part of ST (ST is done alrrady).
In current master code i've added a TODO:
// TODO - This next line should be converted to an internal message. There is no need for a callback here. ST is done.

@yontyon yontyon force-pushed the remove-st-cre-loop branch from 94bc615 to 212f5cc Compare May 1, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants