Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send PeerViewChange with high priority #4755

Merged
merged 9 commits into from
Jul 16, 2024

Conversation

AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented Jun 11, 2024

Closes #577

Changed

  • orchestra updated to 0.4.0
  • PeerViewChange sent with high priority and should be processed first in a queue.
  • To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though.

Testing on Versi

The changes were tested on Versi with two objectives:

  1. Make sure the node functionality does not change.
  2. See how the changes affect performance.

Test setup:

  • 2.5 hours for each case
  • 100 validators
  • 50 parachains
  • validatorsPerCore = 2
  • neededApprovals = 100
  • nDelayTranches = 89
  • relayVrfModuloSamples = 50

During the test period, all nodes ran without any crashes, which satisfies the first objective.

To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct.

Normalized charts with ToF

image
Before

image
After

Conclusion

The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.

@AndreiEres AndreiEres requested a review from alexggh June 11, 2024 09:08
@eskimor
Copy link
Member

eskimor commented Jun 11, 2024

This definitely needs to be tested under high load.

  1. It would be good to measure actual improvements under heavy load, compared to master. E.g. ideally have some reproducible finality lag or even crashes, that goes away with this PR.
  2. Only under high load this feature will make a difference. So also any bugs will only be found in a high load scenario.

To get heavy load you can spawn a good amount of parachains and use malus nodes to create a dispute storm.

@AndreiEres AndreiEres added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 11, 2024
@alexggh
Copy link
Contributor

alexggh commented Jun 11, 2024

This definitely needs to be tested under high load.

  1. It would be good to measure actual improvements under heavy load, compared to master. E.g. ideally have some reproducible finality lag or even crashes, that goes away with this PR.

This PR should help in the situations where we have a high backlog of un-processed messages in approval-distribution and it helps with the fact that we will gossip messages to some peers as soon as we received the PeerViewChange instead of when we caught up with the processing. This has the benefits that peers will faster reached the conclusion that enough assignments have been triggered they won't trigger their own assignment.

We can probably simulate with the approval-voting subsystem-benchmark such a scenario and measure how fast our node gossip the assignments it received to it is peers.

@sandreim
Copy link
Contributor

This definitely needs to be tested under high load.

  1. It would be good to measure actual improvements under heavy load, compared to master. E.g. ideally have some reproducible finality lag or even crashes, that goes away with this PR.

The main benefit of this change is that we keep gossip protocols in better sync with what other peers work on. I expect this to lead to improved gossip propagation times -> lower approval checking lag, less needless work (like @alexggh said above, we won't trigger assignments while a sufficient other assignments sit in our queue).

  1. Only under high load this feature will make a difference. So also any bugs will only be found in a high load scenario.

I think this is mandatory. AFAIR we had some trouble in the past when switching to async channel, maybe it's a good time to try it and maybe get some speedup there as well.

To get heavy load you can spawn a good amount of parachains and use malus nodes to create a dispute storm.

Yeah, like good old times on Versi :)

@@ -322,6 +339,45 @@ pub fn make_subsystem_context<M, S>(
make_buffered_subsystem_context(spawner, 0)
}

/// Message counter over subsystems.
#[derive(Default, Clone)]
pub struct MessageCounter(Arc<Mutex<MessageCounterInner>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of the inner stuff is counters, we can drop the mutex and use atomics.

macro_rules! send_message {
($event:expr, $message:ident) => {
if let Ok(event) = $event.focus() {
let has_high_priority = matches!(event, NetworkBridgeEvent::PeerViewChange(..));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this a bit more, on side effect of this is that we change the order of messages, for example if the system is loaded I can imagine a situation where the bridge sends the PeerConnected on the normal channel and then the PeerViewChange on the priority channel, so the PeerViewChange will arrive before the PeerConnected and it will be discarded.

I don't think it is a frequent problem, but nonetheless something this might cause, so I wonder if we need to make PeerConnected high priority as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it possible to receive PeerViewChange before PeerConnected because of some network problems?

@AndreiEres AndreiEres force-pushed the AndreiEres/priority-peer-view-change branch 3 times, most recently from f7508b9 to f1ebc26 Compare June 28, 2024 14:34
@AndreiEres AndreiEres force-pushed the AndreiEres/priority-peer-view-change branch from f1ebc26 to 33b293f Compare June 28, 2024 14:35
@AndreiEres
Copy link
Contributor Author

Tested on Versi, results posted to the first message

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 3/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6678198

@AndreiEres AndreiEres requested review from alexggh and sandreim July 15, 2024 13:00
Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, left you some comments once those get addressed we should be good to go.

polkadot/node/network/bridge/src/rx/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work @AndreiEres , left just a few more nits.

title: Send PeerViewChange with high priority

doc:
- audience: Node Operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Node developers should be the main target audience. The node operators would like want to hear if and how would this reduces the CPU utilization of the node.

doc:
- audience: Node Operator
description: |
- orchestra updated to 0.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be more specific and mention why we upgrading it, in this case it should be the support for susbsytem message delivery prioritization.

@AndreiEres AndreiEres enabled auto-merge July 16, 2024 17:47
@AndreiEres AndreiEres added this pull request to the merge queue Jul 16, 2024
Merged via the queue into master with commit 975e04b Jul 16, 2024
153 of 158 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/priority-peer-view-change branch July 16, 2024 18:56
ordian added a commit that referenced this pull request Jul 17, 2024
* master:
  add elastic scaling MVP guide (#4663)
  Send PeerViewChange with high priority (#4755)
  [ci] Update forklift in CI image (#5032)
  Adjust base value for statement-distribution regression tests (#5028)
  [pallet_contracts] Add support for transient storage in contracts host functions (#4566)
  [1 / 5] Optimize logic for gossiping assignments (#4848)
  Remove `pallet-getter` usage from pallet-session (#4972)
  command-action: added scoped permissions to the github tokens (#5016)
  net/litep2p: Propagate ValuePut events to the network backend (#5018)
  rpc: add back rpc logger (#4952)
  Updated substrate-relay version for tests (#5017)
  Remove most all usage of `sp-std` (#5010)
  Use sp_runtime::traits::BadOrigin (#5011)
jpserrat pushed a commit to jpserrat/polkadot-sdk that referenced this pull request Jul 18, 2024
Closes paritytech#577

### Changed
- `orchestra` updated to 0.4.0
- `PeerViewChange` sent with high priority and should be processed first
in a queue.
- To count them in tests added tracker to TestSender and TestOverseer.
It acts more like a smoke test though.

### Testing on Versi

The changes were tested on Versi with two objectives:
1. Make sure the node functionality does not change.
2. See how the changes affect performance.

Test setup:
- 2.5 hours for each case
- 100 validators
- 50 parachains
- validatorsPerCore = 2
- neededApprovals = 100
- nDelayTranches = 89
- relayVrfModuloSamples = 50

During the test period, all nodes ran without any crashes, which
satisfies the first objective.

To estimate the change in performance we used ToF charts. The graphs
show that there are no spikes in the top as before. This proves that our
hypothesis is correct.

### Normalized charts with ToF

![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5)
[Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1)


![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880)
[After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1)

### Conclusion

The prioritization of subsystem messages reduces the ToF of the
networking subsystem, which helps faster propagation of gossip messages.
ordian added a commit that referenced this pull request Jul 18, 2024
* master: (125 commits)
  add elastic scaling MVP guide (#4663)
  Send PeerViewChange with high priority (#4755)
  [ci] Update forklift in CI image (#5032)
  Adjust base value for statement-distribution regression tests (#5028)
  [pallet_contracts] Add support for transient storage in contracts host functions (#4566)
  [1 / 5] Optimize logic for gossiping assignments (#4848)
  Remove `pallet-getter` usage from pallet-session (#4972)
  command-action: added scoped permissions to the github tokens (#5016)
  net/litep2p: Propagate ValuePut events to the network backend (#5018)
  rpc: add back rpc logger (#4952)
  Updated substrate-relay version for tests (#5017)
  Remove most all usage of `sp-std` (#5010)
  Use sp_runtime::traits::BadOrigin (#5011)
  network/tx: Ban peers with tx that fail to decode (#5002)
  Try State Hook for Bounties (#4563)
  [statement-distribution] Add metrics for distributed statements in V2 (#4554)
  added sync command (#4818)
  Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935)
  xcm-executor: Improve logging (#4996)
  Remove usage of `sp-std` on templates (#5001)
  ...
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
Closes paritytech#577

### Changed
- `orchestra` updated to 0.4.0
- `PeerViewChange` sent with high priority and should be processed first
in a queue.
- To count them in tests added tracker to TestSender and TestOverseer.
It acts more like a smoke test though.

### Testing on Versi

The changes were tested on Versi with two objectives:
1. Make sure the node functionality does not change.
2. See how the changes affect performance.

Test setup:
- 2.5 hours for each case
- 100 validators
- 50 parachains
- validatorsPerCore = 2
- neededApprovals = 100
- nDelayTranches = 89
- relayVrfModuloSamples = 50

During the test period, all nodes ran without any crashes, which
satisfies the first objective.

To estimate the change in performance we used ToF charts. The graphs
show that there are no spikes in the top as before. This proves that our
hypothesis is correct.

### Normalized charts with ToF

![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5)
[Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1)


![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880)
[After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1)

### Conclusion

The prioritization of subsystem messages reduces the ToF of the
networking subsystem, which helps faster propagation of gossip messages.
ordian added a commit that referenced this pull request Aug 6, 2024
* master: (130 commits)
  add elastic scaling MVP guide (#4663)
  Send PeerViewChange with high priority (#4755)
  [ci] Update forklift in CI image (#5032)
  Adjust base value for statement-distribution regression tests (#5028)
  [pallet_contracts] Add support for transient storage in contracts host functions (#4566)
  [1 / 5] Optimize logic for gossiping assignments (#4848)
  Remove `pallet-getter` usage from pallet-session (#4972)
  command-action: added scoped permissions to the github tokens (#5016)
  net/litep2p: Propagate ValuePut events to the network backend (#5018)
  rpc: add back rpc logger (#4952)
  Updated substrate-relay version for tests (#5017)
  Remove most all usage of `sp-std` (#5010)
  Use sp_runtime::traits::BadOrigin (#5011)
  network/tx: Ban peers with tx that fail to decode (#5002)
  Try State Hook for Bounties (#4563)
  [statement-distribution] Add metrics for distributed statements in V2 (#4554)
  added sync command (#4818)
  Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935)
  xcm-executor: Improve logging (#4996)
  Remove usage of `sp-std` on templates (#5001)
  ...
sfffaaa pushed a commit to peaqnetwork/polkadot-sdk that referenced this pull request Dec 27, 2024
Closes paritytech#577

### Changed
- `orchestra` updated to 0.4.0
- `PeerViewChange` sent with high priority and should be processed first
in a queue.
- To count them in tests added tracker to TestSender and TestOverseer.
It acts more like a smoke test though.

### Testing on Versi

The changes were tested on Versi with two objectives:
1. Make sure the node functionality does not change.
2. See how the changes affect performance.

Test setup:
- 2.5 hours for each case
- 100 validators
- 50 parachains
- validatorsPerCore = 2
- neededApprovals = 100
- nDelayTranches = 89
- relayVrfModuloSamples = 50

During the test period, all nodes ran without any crashes, which
satisfies the first objective.

To estimate the change in performance we used ToF charts. The graphs
show that there are no spikes in the top as before. This proves that our
hypothesis is correct.

### Normalized charts with ToF

![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5)
[Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1)


![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880)
[After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1)

### Conclusion

The prioritization of subsystem messages reduces the ToF of the
networking subsystem, which helps faster propagation of gossip messages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Process PeerViewChange messages with priority
5 participants