Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review the Mutex usage on PeerManager's peers #533

Open
iwanbk opened this issue Dec 25, 2024 · 2 comments
Open

Review the Mutex usage on PeerManager's peers #533

iwanbk opened this issue Dec 25, 2024 · 2 comments

Comments

@iwanbk
Copy link
Member

iwanbk commented Dec 25, 2024

We currently use Mutex to protect PeerManager's peers HashMap.

peers: Mutex<HashMap<Endpoint, PeerInfo>>,

There are two places that i think could be made read-only:

  1. on the fn peers, i'm pretty sure that we can make it read-only

pub fn peers(&self) -> Vec<PeerStats> {
let peer_map = self.inner.peers.lock().unwrap();
let mut pi = Vec::with_capacity(peer_map.len());
for (endpoint, peer_info) in peer_map.iter() {
let connection_state = if peer_info.connecting {
ConnectionState::Connecting
} else if peer_info.pr.alive() {
ConnectionState::Alive
} else {
ConnectionState::Dead
};
pi.push(PeerStats {
endpoint: *endpoint,
pt: peer_info.pt.clone(),
connection_state,
tx_bytes: peer_info.written(),
rx_bytes: peer_info.read(),
});
}
pi
}

  1. on the peer_check_interval

It holds the lock while iterating the whole HashMap, twice, which might be quite long for public nodes

_ = peer_check_interval.tick() => {
// Remove dead inbound peers
self.peers.lock().unwrap().retain(|_, v| v.pt != PeerType::Inbound || v.pr.alive());
debug!("Looking for dead peers");
// check if there is an entry for the peer in the router's peer list
for (endpoint, pi) in self.peers.lock().unwrap().iter_mut() {
if !pi.connecting && !pi.pr.alive() {
debug!(endpoint.address=%endpoint.address(), endpoint.proto=%endpoint.proto(), "Found dead peer");
if pi.pt == PeerType::Inbound {
debug!(endpoint.address=%endpoint.address(), endpoint.proto=%endpoint.proto(), "Refusing to reconnect to inbound peer");
continue
}
// Mark that we are connecting to the peer.
pi.connecting = true;
connection_futures.push(self.clone().connect_peer(*endpoint, pi.con_traffic.clone()));
self.metrics.peer_manager_connection_attempted();
}
}
}

With the current code, we can't simply replace the lock with read, we have to restructure it.

for example this line

self.peers.lock().unwrap().retain(|_, v| v.pt != PeerType::Inbound || v.pr.alive());

could be refactored to

let peers = self.peers.read().unwrap();
let mut to_remove = Vec::new();

for (addr, peer) in peers.iter() {
    if peer.pt == PeerType::Inbound && !peer.pr.alive() {
        to_remove.push(*addr);
    }
}
drop(peers);

// Remove the collected peers
let mut peers = self.peers.write().unwrap();
for addr in to_remove {
    peers.remove(&addr);
}

Similar with the second lock / iteration

I wondered the similar thing with the router, but looks like it is not significant

@iwanbk
Copy link
Member Author

iwanbk commented Dec 26, 2024

on the peer_check_interval

i also wonder why we have separate dead peer check on peer manager & router.
Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers

_ = peer_check_interval.tick() => {

async fn check_for_dead_peers(self) {

@LeeSmet
Copy link
Contributor

LeeSmet commented Dec 26, 2024

The reason for a mutex is basically that this field is accessed only very infrequently outside of the peer managers peer check loop, and for simplicity. Even for the public nodes right now, this is not an issue. I also don't think changing to an RWLock would solve a performance issue here, if there ever is one, compared to a mutex. The main issue would be contention, and RWLocks also behave poorly under contention.

On another note, we'll increase the delay for scanning for dead peers in the peer manager in the future.

i also wonder why we have separate dead peer check on peer manager & router.
Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers

This is a historic thing, however the router uses a different condition to determine if a peer is dead or not. In the router, we check the time since we last received an IHU, whereas the peer_manager is concerned about the liveness of the actual connection. Therefore, to maintain some separation of concern, they both have a different task.

In theory, the router could always know about dead peers, so there could be a channel for the router to notify the peer_manager, though for now this is sufficient. In that scenario, the peer_manager would also need to add some timeouts before reconnecting, as we don't want a hot reconnect loop to a potential malfunctioning peer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants