-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chaos-mesh
for testing (BFT-392)
#70
Conversation
Co-authored-by: Bruno França <[email protected]>
@moshababo, I've implemented the initial approach for this integration here. It's likely to undergo several iterations and discussions to refine the testing flow and the introduction of chaos into the network. I may make some additional changes in the coming days, but please feel free to suggest any improvements or request changes as needed. |
Can you respond to my previous reviews so I'll know what changes to expect? |
@moshababo Done! |
node/tests/src/main.rs
Outdated
.unwrap(); | ||
let last_voted_view: u64 = | ||
serde_json::from_value(response.get("last_commited_block").unwrap().to_owned()).unwrap(); | ||
for _ in 0..5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first my idea was to check a few times that the node was indeed going through the delay and not being able to vote for the views, now I'm not sure if that is useful at all. Maybe there's a better way to achieve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it adds much value, while adding non-determinism to the scenario (for being dependent on the response latency).
It is good to check that status while the chaos resources are deployed, but it's better done via test controlling the teardown explicitly.
node/tests/src/main.rs
Outdated
pub async fn delay_test(test_result: Arc<Mutex<u8>>) -> anyhow::Result<()> { | ||
let client = k8s::get_client().await.unwrap(); | ||
let target_node = "consensus-node-01"; | ||
let ip = k8s::get_node_rpc_address_with_name(&client, target_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to further encapsulate technical details under the test framework. The actual test scenario should more declarative.
node/tests/src/main.rs
Outdated
/// We use unwraps here because this function is intended to be used like a test. | ||
pub async fn delay_test(test_result: Arc<Mutex<u8>>) -> anyhow::Result<()> { | ||
let client = k8s::get_client().await.unwrap(); | ||
let target_node = "consensus-node-01"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test framework should support the installation of chaos resources for a set of nodes, not just a single one.
node/tests/src/main.rs
Outdated
@@ -83,6 +83,15 @@ pub async fn sanity_test() { | |||
} | |||
} | |||
|
|||
/// Sanity test for the RPC server. | |||
/// We use unwraps here because this function is intended to be used like a test. | |||
pub async fn delay_test() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, most byzantine behavior duration will need to be "denominated" in views, not clock time, for actually testing liveness during the disruption (and not just quick recovery once it's over).
It might be needed only for the removal, while applying it always at the beginning of the test, but it's better to just have the test framework supports chaos scheduling with from_view
,to_view
args.
@moshababo given the developments with AttackNet, should we close this? |
What ❔
Incorporate
chaos-mesh
tools into the consensus testing framework.Why ❔
To validate the network's behavior in response to non-canonical node behavior, we need tests.
Chaos-mesh
enables us to introduce delays or node crashes, simulating real-world network scenarios for thorough testing.