Review the Mutex usage on PeerManager's peers #533

iwanbk · 2024-12-25T11:23:13Z

We currently use Mutex to protect PeerManager's peers HashMap.

mycelium/mycelium/src/peer_manager.rs

Line 159 in 74719ba

peers: Mutex<HashMap<Endpoint, PeerInfo>>,

There are two places that i think could be made read-only:

on the fn peers, i'm pretty sure that we can make it read-only

mycelium/mycelium/src/peer_manager.rs

Lines 319 to 339 in 74719ba

    
               pub fn peers(&self) -> Vec<PeerStats> { 
        
                   let peer_map = self.inner.peers.lock().unwrap(); 
        
                   let mut pi = Vec::with_capacity(peer_map.len()); 
        
                   for (endpoint, peer_info) in peer_map.iter() { 
        
                       let connection_state = if peer_info.connecting { 
        
                           ConnectionState::Connecting 
        
                       } else if peer_info.pr.alive() { 
        
                           ConnectionState::Alive 
        
                       } else { 
        
                           ConnectionState::Dead 
        
                       }; 
        
                       pi.push(PeerStats { 
        
                           endpoint: *endpoint, 
        
                           pt: peer_info.pt.clone(), 
        
                           connection_state, 
        
                           tx_bytes: peer_info.written(), 
        
                           rx_bytes: peer_info.read(), 
        
                       }); 
        
                   } 
        
                   pi 
        
               }

on the peer_check_interval

It holds the lock while iterating the whole HashMap, twice, which might be quite long for public nodes

mycelium/mycelium/src/peer_manager.rs

Lines 404 to 422 in 74719ba

    
           _ = peer_check_interval.tick() => { 
        
               // Remove dead inbound peers 
        
               self.peers.lock().unwrap().retain(|_, v| v.pt != PeerType::Inbound || v.pr.alive()); 
        
               debug!("Looking for dead peers"); 
        
               // check if there is an entry for the peer in the router's peer list 
        
               for (endpoint, pi) in self.peers.lock().unwrap().iter_mut() { 
        
                   if !pi.connecting && !pi.pr.alive() { 
        
                       debug!(endpoint.address=%endpoint.address(), endpoint.proto=%endpoint.proto(), "Found dead peer"); 
        
                       if pi.pt == PeerType::Inbound { 
        
                           debug!(endpoint.address=%endpoint.address(), endpoint.proto=%endpoint.proto(), "Refusing to reconnect to inbound peer"); 
        
                           continue 
        
                       } 
        
                       // Mark that we are connecting to the peer. 
        
                       pi.connecting = true; 
        
                       connection_futures.push(self.clone().connect_peer(*endpoint, pi.con_traffic.clone())); 
        
                       self.metrics.peer_manager_connection_attempted(); 
        
                   } 
        
               } 
        
           }

With the current code, we can't simply replace the lock with read, we have to restructure it.

for example this line

self.peers.lock().unwrap().retain(|_, v| v.pt != PeerType::Inbound || v.pr.alive());

could be refactored to

let peers = self.peers.read().unwrap();
let mut to_remove = Vec::new();

for (addr, peer) in peers.iter() {
    if peer.pt == PeerType::Inbound && !peer.pr.alive() {
        to_remove.push(*addr);
    }
}
drop(peers);

// Remove the collected peers
let mut peers = self.peers.write().unwrap();
for addr in to_remove {
    peers.remove(&addr);
}

Similar with the second lock / iteration

I wondered the similar thing with the router, but looks like it is not significant

mycelium/mycelium/src/peer_manager.rs

Line 975 in 74719ba

let rid = self.router.lock().unwrap().router_id();
mycelium/mycelium/src/peer_manager.rs

Line 1125 in 74719ba

let rid = self.router.lock().unwrap().router_id();

The text was updated successfully, but these errors were encountered:

iwanbk · 2024-12-26T03:58:01Z

on the peer_check_interval

i also wonder why we have separate dead peer check on peer manager & router.
Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers

mycelium/mycelium/src/peer_manager.rs

Line 404 in 74719ba

_ = peer_check_interval.tick() => {

mycelium/mycelium/src/router.rs

Line 282 in 74719ba

async fn check_for_dead_peers(self) {

LeeSmet · 2024-12-26T09:48:16Z

The reason for a mutex is basically that this field is accessed only very infrequently outside of the peer managers peer check loop, and for simplicity. Even for the public nodes right now, this is not an issue. I also don't think changing to an RWLock would solve a performance issue here, if there ever is one, compared to a mutex. The main issue would be contention, and RWLocks also behave poorly under contention.

On another note, we'll increase the delay for scanning for dead peers in the peer manager in the future.

i also wonder why we have separate dead peer check on peer manager & router.
Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers

This is a historic thing, however the router uses a different condition to determine if a peer is dead or not. In the router, we check the time since we last received an IHU, whereas the peer_manager is concerned about the liveness of the actual connection. Therefore, to maintain some separation of concern, they both have a different task.

In theory, the router could always know about dead peers, so there could be a channel for the router to notify the peer_manager, though for now this is sufficient. In that scenario, the peer_manager would also need to add some timeouts before reconnecting, as we don't want a hot reconnect loop to a potential malfunctioning peer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review the Mutex usage on PeerManager's peers #533

Review the Mutex usage on PeerManager's peers #533

iwanbk commented Dec 25, 2024

iwanbk commented Dec 26, 2024

LeeSmet commented Dec 26, 2024

Review the Mutex usage on PeerManager's peers #533

Review the Mutex usage on PeerManager's peers #533

Comments

iwanbk commented Dec 25, 2024

iwanbk commented Dec 26, 2024

LeeSmet commented Dec 26, 2024