Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance data reports #459

Open
scottyeager opened this issue Sep 24, 2024 · 12 comments
Open

Performance data reports #459

scottyeager opened this issue Sep 24, 2024 · 12 comments
Assignees

Comments

@scottyeager
Copy link

I've been collecting some data using SmokePing to get a sense for Mycelium performance from my perspective at home. A description of the methodology and everything needed to reproduce my approach is on this repo.

In summary, I'm connecting to all public Mycelium nodes via IPv4 TCP and pinging them periodically both over Mycelium and over IPv4. This was meant mostly to be a benchmark to evaluate other Mycelium hosts against, but it's revealed some trends that I think are worth highlighting. For most of the public nodes, I observe large variations in latency over Mycelium versus regular IPv4.

Here are some high level graphs, first showing the IPv4 ping performance to the public nodes. The line represents median ping time, with the "smoke" representing deviations from the median:

image

We can see that latency to these nodes is basically flatline with occasional minor deviations. This sample is representative of the data I've collected so far.

Here's the same view, but for pings sent over Mycelium:

image

Sometimes the behavior over Mycelium seems to be related an issue also seen on regular IPv4, but sometimes not. Here's an example from the SG node of a rather substantial latency spike on Mycelium:

image

But IPv4 looks rather clean over the same time period:

image

Here's a case where the issue observed on IPv4 seems to be amplified over Mycelium. Here we see relatively high packet loss in purple and pink:

image

Versus relatively low packet loss over IPv4 at the same time:

image

Here's a longer window, showing the large latency swings and a period of packet loss on Mycelium:

image

Versus IPv4:

image

So what I see overall is that median latency over Mycelium can vary by 100% hour to hour for directly connected public peers, while the medians over IPv4 to the same nodes tend to vary by no more than 5-10%.

It also appears that small amounts of packet loss on the underlay network get amplified into larger packet loss over Mycelium.

However, the latency to my closest public peer, US West, is much more stable. That could of course a be a coincidence, which could be cleared up by running the same test from different locations.

It's possible that latency is shifting along with load on the public nodes, and perhaps strain on the Mycelium process could cause these results. I don't have visibility to say whether that's happening, but it doesn't seem likely at the current level of exposure the project has and also the lack of clear correlation between nodes for times of high latency.

@scottyeager
Copy link
Author

Here's another look at some of the data, this time for a bit more of a "real world" test. I'm also executing both ping and SSH probes against two VMs that are running on Zos nodes in the GreenEdge St Gallen farm. Anecdotally there seems to be intermittent issues with SSH connections to Zos VMs over Mycelium, so the intention here was to try to capture that in action.

Here's the SSH probe results via Mycelium to one VM over a ten day period:

image

Here's the ping performance via Mycelium to the same VM for the same period:

image

Next I'll show ping performance over Mycelium to the four public nodes closest to St Gallen:

image
image
image
image

Also pings to those public nodes over IPv4:

image
image
image
image

I'm not sure which nodes exactly are involved in routing to this VM. All I can see in the Mycelium logs is that the next hop for the route is the public node US West. Here's the ping graph for both Mycelium and IPv4 for that node:

image
image

I suppose the traffic could also be traversing the US East node before crossing the pond, so for good measure:

image
image

So the summary is that we can notice some periods where there is significant loss of SSH messages over Mycelium which seem to have some correlation with ping packet loss over Mycelium as well.

@scottyeager scottyeager changed the title Latency spikes to directly connected peers Performance data reports Oct 2, 2024
@LeeSmet
Copy link
Contributor

LeeSmet commented Oct 14, 2024

There's quite some things going on here so I'll start by trying to explain a few of the behaviors seen here. The current mycelium transports are tcp and Quic reliable channels, which behave like tcp (ordered reliable delivery with acknowledgements and congestion control). This is actually not the greatest for an overlay network, since the overlay assumes it working on IP, which is by nature lossy. It would be better if we could use UDP, though right now plain UDP is not really feasible since we rely on the reliable semantics for the protocol messages. I have some work for using quic datagrams for actual data on a branch, though that needs more tuning it seems to be useful.

Packet loss in the underlay, when using these transports, generally translates to higher latencies due to retransmission, though it is interesting to see that packet loss is worse in the overlay.

Latency spikes are generally somewhat expected, since mycelium is a userspace process, where general network handling is done in the kernel (especially for ping). Currently we are doing continuous network throughput testing, in general this also causes spikes in packets handled every 5ish minutes, which translates to somewhat higher latency during these periods.

For ssh sessions, part of this is likely due to tcp metldown (the negative affect of packetloss in tcp in tcp, where both overlay and underlay sessions independently try to recover the lost packets, leading to them fighting against eachother).

In general, some more testing will have to be done in none optimal network conditions to see how the network behaves (high latency with some minor packet loss), but this is an interesting starting point

@iwanbk iwanbk self-assigned this Nov 20, 2024
@scottyeager
Copy link
Author

scottyeager commented Nov 20, 2024

Thanks for the notes @LeeSmet. The challenges with tcp over tcp certainly make sense. That's also good to know about the throughput testing—I'll keep an eye out for any possible correlation with my charts.

Since my initial posts, things have started to look considerably worse. There's significantly increased packet loss and latency.

Here are some 30 hour charts for SSH probes to the first VM, comparing mycelium performance (top) with regular ipv6 performance (bottom):

image
image

And same for the second VM:

image
image

The ping charts for these VMs look basically the same.

There's also a similar pattern on the ping charts to the public mycelium nodes. For locales with two nodes, they are very similar so I just show one:

DE 1 Myc
image
DE 1 IPv4
image
BE 3 Myc
image
BE 3 IPv4
image
FI 5 Myc
image
FI 5 IPv4
image
US EAST Myc
image
US EAST IPv4
image
US WEST Myc
image
US WEST IPv4
image
SG Myc
image
SG IPv4
image
IND Myc
image
IND IPv4
image

One note here is that I'm not totally sure what the blank sections on the charts indicate. I think that means that the probe command failed with an error, rather than a timeout. That would track with the fact that I'm often seeing "no route" and "network unreachable" messages when trying to connect to freshly deployed Zos VMs over mycelium.

@iwanbk
Copy link
Member

iwanbk commented Nov 21, 2024

Hi @scottyeager @LeeSmet

I can reproduce it quite easily and can see the difference using ping

i did three experiments, each of them only use one public node to simplify the test

  1. using SG public node
  • it is the closest node from my place in Indonesia, ping time is ~20ms
  • it didn't work for me, even after 20 minutes i still got
Destination unreachable: No route
  1. using my own public node in Amsterdam
  • it is in digitalocean
  • ping to other mycelium node in SG
  • not much difference between ipv4 ping time and mycelium ping time
  • ipv4 ping time is ~180-200ms
  1. use DE 1 public node
  • ping from my PC in my home to my VPS on SG, both using DE 1 public node
  • i have to wait for ~5-7 minutes for the mycelium ping to work
  • mycelium ping time (600ms - thousand ms) almost always much higher than ipv4 ping time (total ~400ms)
    image

Considering that my ipv4 ping time to DE 1 and my own amsterdam is very similar, i suspect that the DE 1 public node is overloaded.
I think we could check in these three things:

  1. check & improve the TCP over TCP issue
  • i read that one of the main issue is about RTO (retransmission timeout)
  • but how to see if we got this issue?
  1. Whether the mycelium facing scalability issue
  • discussed with Lee, we indeed have some potential performance issue on the routing part. I don't think that it relates to this issue

  • i think i'll check the data path. @LeeSmet maybe some pointer?

  • think about what we will do in case of overloaded node? because it eventually happen, or it might already happens?

    • keep receiving new connection, which could worsening the condition
    • do rate limiting or reject new connection:
      - clearer condition from both the server & client side
      - avoiding worsening the condition
      - client will retry the connection and try other peers anyway
    • need to define "overloaded"
  1. Make a tool to stress test the mycelium
  • so we could reproduce it easier
  • especially if we indeed found that the issue is scalability issue

@iwanbk
Copy link
Member

iwanbk commented Nov 21, 2024

It would be good if we can do profiling

is this the recommended way @LeeSmet ?
#141 (comment)

@iwanbk
Copy link
Member

iwanbk commented Nov 25, 2024

I did a bit check for the data paths:

  1. I remember that there was a plan on doing TUN offload Implement TUN offloads #141 .
    @LeeSmet what is the state of TUN offload? Looks like it is a good idea. I found that tailscale also does the same thing

  2. I found that tokio-tun has send_vectored
    @LeeSmet any reason why Linux uses send while Macos uses write_vectored (although it is not used in a correct way)?

  3. It is probably worth to compare rust-tun and tokio-tun
    From a quick look, looks like not much difference. But still worth to take a closer look.

@LeeSmet
Copy link
Contributor

LeeSmet commented Nov 25, 2024

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

The initial implementation is as simple as possible, so we just pull packets from the receiving channel and send them one by one on linux. On mac we use write_vectored since packets need to be preceded by a header on the tun device and this seemingly can't be disabled (like we do on linux), so write vectored here is just used as a way to cheaply send this header in the same syscall.

We could use tokio's channel recv_many on linux followed by a write vectored for some performance improvements for now, that should be a quick and easy fix. Note that there is a subtle difference between write_vectored and sending GSO enabled packets. Also, for mac this is slightly more cumbersome since every packet needs to be preceded by the header, so after getting a vector of potential packets from the channel we'd need to create a new vector where every even index is a packet header and every odd index is an actual packet.

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

Ok, moved to the original issue at #141 (comment)

On mac we use write_vectored since packets need to be preceded by a header on the tun device and this seemingly can't be disabled (like we do on linux), so write vectored here is just used as a way to cheaply send this header in the same syscall.

Oh, yes, i just remember that you already told me about this, my bad,

We could use tokio's channel recv_many on linux followed by a write vectored for some performance improvements for now, that should be a quick and easy fix

cool,
Any concerns or things to note/check if i want to implement this?

@iwanbk
Copy link
Member

iwanbk commented Dec 2, 2024

We could use tokio's channel recv_many on linux followed by a write vectored for some performance improvements for now, that should be a quick and easy fix

I've tried this using below patch, the change is indeed small.

Here are my simple observation when testing the performance using scp:

  • the performance depend on max number of messages we want to receive with recv_many
  • if it is 1, the performance is similar
  • if it is more than 1, the performance is worse

I'm thinking why it worse:

  • the most obvious thing is because my system is not busy enough, there is small chance that there are more than one message that ready to be consumed, so that it ended up waiting for another message without result
  • recv_many is slow, as depicted above
  • the send_vectored is indeed slow
diff --git a/mycelium/src/tun/linux.rs b/mycelium/src/tun/linux.rs
index 52b2b13..ccf5e9e 100644
--- a/mycelium/src/tun/linux.rs
+++ b/mycelium/src/tun/linux.rs
@@ -72,15 +72,35 @@ pub async fn new(
     // Spawn a single task to manage the TUN interface
     tokio::spawn(async move {
         let mut buf_hold = None;
+        let mut recv_vec = Vec::new();
+        let num = 2;
+        info!("IBK create recv_vec with num={}", num);
         loop {
             let mut buf = if let Some(buf) = buf_hold.take() {
                 buf
             } else {
                 PacketBuffer::new()
             };
-
+            use std::io::IoSlice;
             select! {
-                data = sink_receiver.recv() => {
+                num_buf = sink_receiver.recv_many(&mut recv_vec, num) => {
+                    if num_buf > 0 {
+                        //info!("data = {}", data);
+                         // Create IoSlice vector from PacketBuffers
+                         let io_slices: Vec<IoSlice> = recv_vec.iter()
+                         .map(|buf| IoSlice::new(&buf))
+                         .collect();
+
+                     if let Err(e) = tun.send_vectored(&io_slices).await {
+                         error!("Failed to send data to tun interface {e}");
+                     }
+                        recv_vec.clear(); // Clear for next use
+                    } else {
+                        return; // Channel closed
+                    }
+                    buf_hold = Some(buf);
+                }
+                /*data = sink_receiver.recv() => {
                     match data {
                         None => return,
                         Some(data) => {

@LeeSmet
Copy link
Contributor

LeeSmet commented Dec 3, 2024

It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point. There needs to be some detailed analysis with tools like perf/hotspot/samply,.... to figure out where the function is spending time. Note that tun.send_vectored requires IoSlices, and the current implementation which creates these actually creates these and collects them in a new vector, which means there is an allocation now on the hot path. If this is indeed an issue, it will be noticiable in a flamegraph

@iwanbk
Copy link
Member

iwanbk commented Dec 3, 2024

the most obvious thing is because my system is not busy enough, there is small chance that there are more than one message that ready to be consumed, so that it ended up waiting for another message without result
recv_many is slow, as depicted above

It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point.

Yes yes, i think both statements could be true:

  • recv_many is probably only slower for systems that are not busy (in this case: non public node)
  • It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point. -> it is also true

Just like the current case #459 (comment):

  • using private node is fast
  • using public node is slow

There needs to be some detailed analysis with tools like perf/hotspot/samply,.

But the problem here, the result on private node could be misleading, because IMO recv_many is indeed optimized for busy system.
Except that we can create a load tester tool.

@iwanbk
Copy link
Member

iwanbk commented Dec 19, 2024

FYI, i discarded the work on TUN.
It was started because i misunderstood how TUN device being used in the mycelium.

Because Lee works on the routing part, i'm checking the data path.
Some improvements was merged yesterday at #528.

Now i'm digging deeper on how the peer's data path works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants