Performance data reports #459

scottyeager · 2024-09-24T23:54:55Z

I've been collecting some data using SmokePing to get a sense for Mycelium performance from my perspective at home. A description of the methodology and everything needed to reproduce my approach is on this repo.

In summary, I'm connecting to all public Mycelium nodes via IPv4 TCP and pinging them periodically both over Mycelium and over IPv4. This was meant mostly to be a benchmark to evaluate other Mycelium hosts against, but it's revealed some trends that I think are worth highlighting. For most of the public nodes, I observe large variations in latency over Mycelium versus regular IPv4.

Here are some high level graphs, first showing the IPv4 ping performance to the public nodes. The line represents median ping time, with the "smoke" representing deviations from the median:

We can see that latency to these nodes is basically flatline with occasional minor deviations. This sample is representative of the data I've collected so far.

Here's the same view, but for pings sent over Mycelium:

Sometimes the behavior over Mycelium seems to be related an issue also seen on regular IPv4, but sometimes not. Here's an example from the SG node of a rather substantial latency spike on Mycelium:

But IPv4 looks rather clean over the same time period:

Here's a case where the issue observed on IPv4 seems to be amplified over Mycelium. Here we see relatively high packet loss in purple and pink:

Versus relatively low packet loss over IPv4 at the same time:

Here's a longer window, showing the large latency swings and a period of packet loss on Mycelium:

Versus IPv4:

So what I see overall is that median latency over Mycelium can vary by 100% hour to hour for directly connected public peers, while the medians over IPv4 to the same nodes tend to vary by no more than 5-10%.

It also appears that small amounts of packet loss on the underlay network get amplified into larger packet loss over Mycelium.

However, the latency to my closest public peer, US West, is much more stable. That could of course a be a coincidence, which could be cleared up by running the same test from different locations.

It's possible that latency is shifting along with load on the public nodes, and perhaps strain on the Mycelium process could cause these results. I don't have visibility to say whether that's happening, but it doesn't seem likely at the current level of exposure the project has and also the lack of clear correlation between nodes for times of high latency.

scottyeager · 2024-10-01T22:27:31Z

Here's another look at some of the data, this time for a bit more of a "real world" test. I'm also executing both ping and SSH probes against two VMs that are running on Zos nodes in the GreenEdge St Gallen farm. Anecdotally there seems to be intermittent issues with SSH connections to Zos VMs over Mycelium, so the intention here was to try to capture that in action.

Here's the SSH probe results via Mycelium to one VM over a ten day period:

Here's the ping performance via Mycelium to the same VM for the same period:

Next I'll show ping performance over Mycelium to the four public nodes closest to St Gallen:

Also pings to those public nodes over IPv4:

I'm not sure which nodes exactly are involved in routing to this VM. All I can see in the Mycelium logs is that the next hop for the route is the public node US West. Here's the ping graph for both Mycelium and IPv4 for that node:

I suppose the traffic could also be traversing the US East node before crossing the pond, so for good measure:

So the summary is that we can notice some periods where there is significant loss of SSH messages over Mycelium which seem to have some correlation with ping packet loss over Mycelium as well.

LeeSmet · 2024-10-14T10:08:07Z

There's quite some things going on here so I'll start by trying to explain a few of the behaviors seen here. The current mycelium transports are tcp and Quic reliable channels, which behave like tcp (ordered reliable delivery with acknowledgements and congestion control). This is actually not the greatest for an overlay network, since the overlay assumes it working on IP, which is by nature lossy. It would be better if we could use UDP, though right now plain UDP is not really feasible since we rely on the reliable semantics for the protocol messages. I have some work for using quic datagrams for actual data on a branch, though that needs more tuning it seems to be useful.

Packet loss in the underlay, when using these transports, generally translates to higher latencies due to retransmission, though it is interesting to see that packet loss is worse in the overlay.

Latency spikes are generally somewhat expected, since mycelium is a userspace process, where general network handling is done in the kernel (especially for ping). Currently we are doing continuous network throughput testing, in general this also causes spikes in packets handled every 5ish minutes, which translates to somewhat higher latency during these periods.

For ssh sessions, part of this is likely due to tcp metldown (the negative affect of packetloss in tcp in tcp, where both overlay and underlay sessions independently try to recover the lost packets, leading to them fighting against eachother).

In general, some more testing will have to be done in none optimal network conditions to see how the network behaves (high latency with some minor packet loss), but this is an interesting starting point

scottyeager · 2024-11-20T18:51:26Z

Thanks for the notes @LeeSmet. The challenges with tcp over tcp certainly make sense. That's also good to know about the throughput testing—I'll keep an eye out for any possible correlation with my charts.

Since my initial posts, things have started to look considerably worse. There's significantly increased packet loss and latency.

Here are some 30 hour charts for SSH probes to the first VM, comparing mycelium performance (top) with regular ipv6 performance (bottom):

And same for the second VM:

The ping charts for these VMs look basically the same.

There's also a similar pattern on the ping charts to the public mycelium nodes. For locales with two nodes, they are very similar so I just show one:

DE 1 Myc

DE 1 IPv4

BE 3 Myc

BE 3 IPv4

FI 5 Myc

FI 5 IPv4

US EAST Myc

US EAST IPv4

US WEST Myc

US WEST IPv4

SG Myc

SG IPv4

IND Myc

IND IPv4

One note here is that I'm not totally sure what the blank sections on the charts indicate. I think that means that the probe command failed with an error, rather than a timeout. That would track with the fact that I'm often seeing "no route" and "network unreachable" messages when trying to connect to freshly deployed Zos VMs over mycelium.

iwanbk · 2024-11-21T10:38:08Z

Hi @scottyeager @LeeSmet

I can reproduce it quite easily and can see the difference using ping

i did three experiments, each of them only use one public node to simplify the test

using SG public node

it is the closest node from my place in Indonesia, ping time is ~20ms
it didn't work for me, even after 20 minutes i still got

Destination unreachable: No route

using my own public node in Amsterdam

it is in digitalocean
ping to other mycelium node in SG
not much difference between ipv4 ping time and mycelium ping time
ipv4 ping time is ~180-200ms

use DE 1 public node

ping from my PC in my home to my VPS on SG, both using DE 1 public node
i have to wait for ~5-7 minutes for the mycelium ping to work
mycelium ping time (600ms - thousand ms) almost always much higher than ipv4 ping time (total ~400ms)

Considering that my ipv4 ping time to DE 1 and my own amsterdam is very similar, i suspect that the DE 1 public node is overloaded.
I think we could check in these three things:

check & improve the TCP over TCP issue

i read that one of the main issue is about RTO (retransmission timeout)
but how to see if we got this issue?

Whether the mycelium facing scalability issue

discussed with Lee, we indeed have some potential performance issue on the routing part. I don't think that it relates to this issue
i think i'll check the data path. @LeeSmet maybe some pointer?
think about what we will do in case of overloaded node? because it eventually happen, or it might already happens?
- keep receiving new connection, which could worsening the condition
- do rate limiting or reject new connection:
  - clearer condition from both the server & client side
  - avoiding worsening the condition
  - client will retry the connection and try other peers anyway
- need to define "overloaded"

Make a tool to stress test the mycelium

so we could reproduce it easier
especially if we indeed found that the issue is scalability issue

iwanbk · 2024-11-21T10:45:33Z

It would be good if we can do profiling

is this the recommended way @LeeSmet ?
#141 (comment)

iwanbk · 2024-11-25T08:21:46Z

I did a bit check for the data paths:

I remember that there was a plan on doing TUN offload Implement TUN offloads #141 .
@LeeSmet what is the state of TUN offload? Looks like it is a good idea. I found that tailscale also does the same thing
I found that tokio-tun has send_vectored
@LeeSmet any reason why Linux uses send while Macos uses write_vectored (although it is not used in a correct way)?
It is probably worth to compare rust-tun and tokio-tun
From a quick look, looks like not much difference. But still worth to take a closer look.

LeeSmet · 2024-11-25T09:49:37Z

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

The initial implementation is as simple as possible, so we just pull packets from the receiving channel and send them one by one on linux. On mac we use write_vectored since packets need to be preceded by a header on the tun device and this seemingly can't be disabled (like we do on linux), so write vectored here is just used as a way to cheaply send this header in the same syscall.

We could use tokio's channel recv_many on linux followed by a write vectored for some performance improvements for now, that should be a quick and easy fix. Note that there is a subtle difference between write_vectored and sending GSO enabled packets. Also, for mac this is slightly more cumbersome since every packet needs to be preceded by the header, so after getting a vector of potential packets from the channel we'd need to create a new vector where every even index is a packet header and every odd index is an actual packet.

iwanbk · 2024-11-26T07:06:36Z

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

Ok, moved to the original issue at #141 (comment)

On mac we use write_vectored since packets need to be preceded by a header on the tun device and this seemingly can't be disabled (like we do on linux), so write vectored here is just used as a way to cheaply send this header in the same syscall.

Oh, yes, i just remember that you already told me about this, my bad,

We could use tokio's channel recv_many on linux followed by a write vectored for some performance improvements for now, that should be a quick and easy fix

cool,
Any concerns or things to note/check if i want to implement this?

iwanbk · 2024-12-02T09:16:06Z

We could use tokio's channel recv_many on linux followed by a write vectored for some performance improvements for now, that should be a quick and easy fix

I've tried this using below patch, the change is indeed small.

Here are my simple observation when testing the performance using scp:

the performance depend on max number of messages we want to receive with recv_many
if it is 1, the performance is similar
if it is more than 1, the performance is worse

I'm thinking why it worse:

the most obvious thing is because my system is not busy enough, there is small chance that there are more than one message that ready to be consumed, so that it ended up waiting for another message without result
recv_many is slow, as depicted above
the send_vectored is indeed slow

diff --git a/mycelium/src/tun/linux.rs b/mycelium/src/tun/linux.rs
index 52b2b13..ccf5e9e 100644
--- a/mycelium/src/tun/linux.rs
+++ b/mycelium/src/tun/linux.rs
@@ -72,15 +72,35 @@ pub async fn new(
     // Spawn a single task to manage the TUN interface
     tokio::spawn(async move {
         let mut buf_hold = None;
+        let mut recv_vec = Vec::new();
+        let num = 2;
+        info!("IBK create recv_vec with num={}", num);
         loop {
             let mut buf = if let Some(buf) = buf_hold.take() {
                 buf
             } else {
                 PacketBuffer::new()
             };
-
+            use std::io::IoSlice;
             select! {
-                data = sink_receiver.recv() => {
+                num_buf = sink_receiver.recv_many(&mut recv_vec, num) => {
+                    if num_buf > 0 {
+                        //info!("data = {}", data);
+                         // Create IoSlice vector from PacketBuffers
+                         let io_slices: Vec<IoSlice> = recv_vec.iter()
+                         .map(|buf| IoSlice::new(&buf))
+                         .collect();
+
+                     if let Err(e) = tun.send_vectored(&io_slices).await {
+                         error!("Failed to send data to tun interface {e}");
+                     }
+                        recv_vec.clear(); // Clear for next use
+                    } else {
+                        return; // Channel closed
+                    }
+                    buf_hold = Some(buf);
+                }
+                /*data = sink_receiver.recv() => {
                     match data {
                         None => return,
                         Some(data) => {

LeeSmet · 2024-12-03T13:13:35Z

It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point. There needs to be some detailed analysis with tools like perf/hotspot/samply,.... to figure out where the function is spending time. Note that tun.send_vectored requires IoSlices, and the current implementation which creates these actually creates these and collects them in a new vector, which means there is an allocation now on the hot path. If this is indeed an issue, it will be noticiable in a flamegraph

iwanbk · 2024-12-03T14:31:45Z

the most obvious thing is because my system is not busy enough, there is small chance that there are more than one message that ready to be consumed, so that it ended up waiting for another message without result
recv_many is slow, as depicted above

It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point.

Yes yes, i think both statements could be true:

recv_many is probably only slower for systems that are not busy (in this case: non public node)
It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point. -> it is also true

Just like the current case #459 (comment):

using private node is fast
using public node is slow

There needs to be some detailed analysis with tools like perf/hotspot/samply,.

But the problem here, the result on private node could be misleading, because IMO recv_many is indeed optimized for busy system.
Except that we can create a load tester tool.

iwanbk · 2024-12-19T07:02:18Z

FYI, i discarded the work on TUN.
It was started because i misunderstood how TUN device being used in the mycelium.

Because Lee works on the routing part, i'm checking the data path.
Some improvements was merged yesterday at #528.

Now i'm digging deeper on how the peer's data path works.

scottyeager changed the title ~~Latency spikes to directly connected peers~~ Performance data reports Oct 2, 2024

iwanbk self-assigned this Nov 20, 2024

iwanbk mentioned this issue Nov 26, 2024

Implement TUN offloads #141

Open

iwanbk mentioned this issue Nov 26, 2024

Mycelium not working properly on my local machine #515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance data reports #459

Performance data reports #459

scottyeager commented Sep 24, 2024

scottyeager commented Oct 1, 2024

LeeSmet commented Oct 14, 2024

scottyeager commented Nov 20, 2024 •

edited

Loading

iwanbk commented Nov 21, 2024 •

edited

Loading

iwanbk commented Nov 21, 2024

iwanbk commented Nov 25, 2024

LeeSmet commented Nov 25, 2024

iwanbk commented Nov 26, 2024

iwanbk commented Dec 2, 2024 •

edited

Loading

LeeSmet commented Dec 3, 2024

iwanbk commented Dec 3, 2024 •

edited

Loading

iwanbk commented Dec 19, 2024

Performance data reports #459

Performance data reports #459

Comments

scottyeager commented Sep 24, 2024

scottyeager commented Oct 1, 2024

LeeSmet commented Oct 14, 2024

scottyeager commented Nov 20, 2024 • edited Loading

iwanbk commented Nov 21, 2024 • edited Loading

iwanbk commented Nov 21, 2024

iwanbk commented Nov 25, 2024

LeeSmet commented Nov 25, 2024

iwanbk commented Nov 26, 2024

iwanbk commented Dec 2, 2024 • edited Loading

LeeSmet commented Dec 3, 2024

iwanbk commented Dec 3, 2024 • edited Loading

iwanbk commented Dec 19, 2024

scottyeager commented Nov 20, 2024 •

edited

Loading

iwanbk commented Nov 21, 2024 •

edited

Loading

iwanbk commented Dec 2, 2024 •

edited

Loading

iwanbk commented Dec 3, 2024 •

edited

Loading