-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send PeerViewChange with high priority #4755
Conversation
This definitely needs to be tested under high load.
To get heavy load you can spawn a good amount of parachains and use malus nodes to create a dispute storm. |
This PR should help in the situations where we have a high backlog of un-processed messages in approval-distribution and it helps with the fact that we will gossip messages to some peers as soon as we received the PeerViewChange instead of when we caught up with the processing. This has the benefits that peers will faster reached the conclusion that enough assignments have been triggered they won't trigger their own assignment. We can probably simulate with the approval-voting subsystem-benchmark such a scenario and measure how fast our node gossip the assignments it received to it is peers. |
The main benefit of this change is that we keep gossip protocols in better sync with what other peers work on. I expect this to lead to improved gossip propagation times -> lower approval checking lag, less needless work (like @alexggh said above, we won't trigger assignments while a sufficient other assignments sit in our queue).
I think this is mandatory. AFAIR we had some trouble in the past when switching to async channel, maybe it's a good time to try it and maybe get some speedup there as well.
Yeah, like good old times on Versi :) |
@@ -322,6 +339,45 @@ pub fn make_subsystem_context<M, S>( | |||
make_buffered_subsystem_context(spawner, 0) | |||
} | |||
|
|||
/// Message counter over subsystems. | |||
#[derive(Default, Clone)] | |||
pub struct MessageCounter(Arc<Mutex<MessageCounterInner>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of the inner stuff is counters, we can drop the mutex and use atomics.
macro_rules! send_message { | ||
($event:expr, $message:ident) => { | ||
if let Ok(event) = $event.focus() { | ||
let has_high_priority = matches!(event, NetworkBridgeEvent::PeerViewChange(..)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this a bit more, on side effect of this is that we change the order of messages, for example if the system is loaded I can imagine a situation where the bridge sends the PeerConnected
on the normal channel and then the PeerViewChange on the priority channel, so the PeerViewChange will arrive before the PeerConnected and it will be discarded.
I don't think it is a frequent problem, but nonetheless something this might cause, so I wonder if we need to make PeerConnected
high priority as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it possible to receive PeerViewChange before PeerConnected because of some network problems?
f7508b9
to
f1ebc26
Compare
f1ebc26
to
33b293f
Compare
Tested on Versi, results posted to the first message |
The CI pipeline was cancelled due to failure one of the required jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, left you some comments once those get addressed we should be good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work @AndreiEres , left just a few more nits.
prdoc/pr_4755.prdoc
Outdated
title: Send PeerViewChange with high priority | ||
|
||
doc: | ||
- audience: Node Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Node developers should be the main target audience. The node operators would like want to hear if and how would this reduces the CPU utilization of the node.
prdoc/pr_4755.prdoc
Outdated
doc: | ||
- audience: Node Operator | ||
description: | | ||
- orchestra updated to 0.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be more specific and mention why we upgrading it, in this case it should be the support for susbsytem message delivery prioritization.
* master: add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011)
Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF ![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5) [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1) ![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880) [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.
* master: (125 commits) add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011) network/tx: Ban peers with tx that fail to decode (#5002) Try State Hook for Bounties (#4563) [statement-distribution] Add metrics for distributed statements in V2 (#4554) added sync command (#4818) Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935) xcm-executor: Improve logging (#4996) Remove usage of `sp-std` on templates (#5001) ...
Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF ![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5) [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1) ![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880) [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.
* master: (130 commits) add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011) network/tx: Ban peers with tx that fail to decode (#5002) Try State Hook for Bounties (#4563) [statement-distribution] Add metrics for distributed statements in V2 (#4554) added sync command (#4818) Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935) xcm-executor: Improve logging (#4996) Remove usage of `sp-std` on templates (#5001) ...
Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF ![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5) [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1) ![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880) [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.
Closes #577
Changed
orchestra
updated to 0.4.0PeerViewChange
sent with high priority and should be processed first in a queue.Testing on Versi
The changes were tested on Versi with two objectives:
Test setup:
During the test period, all nodes ran without any crashes, which satisfies the first objective.
To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct.
Normalized charts with ToF
Before
After
Conclusion
The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.