-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance data reports #459
Comments
Here's another look at some of the data, this time for a bit more of a "real world" test. I'm also executing both ping and SSH probes against two VMs that are running on Zos nodes in the GreenEdge St Gallen farm. Anecdotally there seems to be intermittent issues with SSH connections to Zos VMs over Mycelium, so the intention here was to try to capture that in action. Here's the SSH probe results via Mycelium to one VM over a ten day period: Here's the ping performance via Mycelium to the same VM for the same period: Next I'll show ping performance over Mycelium to the four public nodes closest to St Gallen: Also pings to those public nodes over IPv4: I'm not sure which nodes exactly are involved in routing to this VM. All I can see in the Mycelium logs is that the next hop for the route is the public node US West. Here's the ping graph for both Mycelium and IPv4 for that node: I suppose the traffic could also be traversing the US East node before crossing the pond, so for good measure: So the summary is that we can notice some periods where there is significant loss of SSH messages over Mycelium which seem to have some correlation with ping packet loss over Mycelium as well. |
There's quite some things going on here so I'll start by trying to explain a few of the behaviors seen here. The current mycelium transports are tcp and Quic reliable channels, which behave like tcp (ordered reliable delivery with acknowledgements and congestion control). This is actually not the greatest for an overlay network, since the overlay assumes it working on IP, which is by nature lossy. It would be better if we could use UDP, though right now plain UDP is not really feasible since we rely on the reliable semantics for the protocol messages. I have some work for using quic datagrams for actual data on a branch, though that needs more tuning it seems to be useful. Packet loss in the underlay, when using these transports, generally translates to higher latencies due to retransmission, though it is interesting to see that packet loss is worse in the overlay. Latency spikes are generally somewhat expected, since mycelium is a userspace process, where general network handling is done in the kernel (especially for ping). Currently we are doing continuous network throughput testing, in general this also causes spikes in packets handled every 5ish minutes, which translates to somewhat higher latency during these periods. For ssh sessions, part of this is likely due to tcp metldown (the negative affect of packetloss in tcp in tcp, where both overlay and underlay sessions independently try to recover the lost packets, leading to them fighting against eachother). In general, some more testing will have to be done in none optimal network conditions to see how the network behaves (high latency with some minor packet loss), but this is an interesting starting point |
Thanks for the notes @LeeSmet. The challenges with tcp over tcp certainly make sense. That's also good to know about the throughput testing—I'll keep an eye out for any possible correlation with my charts. Since my initial posts, things have started to look considerably worse. There's significantly increased packet loss and latency. Here are some 30 hour charts for SSH probes to the first VM, comparing mycelium performance (top) with regular ipv6 performance (bottom): And same for the second VM: The ping charts for these VMs look basically the same. There's also a similar pattern on the ping charts to the public mycelium nodes. For locales with two nodes, they are very similar so I just show one: DE 1 Myc One note here is that I'm not totally sure what the blank sections on the charts indicate. I think that means that the probe command failed with an error, rather than a timeout. That would track with the fact that I'm often seeing "no route" and "network unreachable" messages when trying to connect to freshly deployed Zos VMs over mycelium. |
I can reproduce it quite easily and can see the difference using i did three experiments, each of them only use one public node to simplify the test
Destination unreachable: No route
Considering that my ipv4 ping time to DE 1 and my own amsterdam is very similar, i suspect that the DE 1 public node is overloaded.
|
It would be good if we can do profiling is this the recommended way @LeeSmet ? |
I did a bit check for the data paths:
|
The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable. The initial implementation is as simple as possible, so we just pull packets from the receiving channel and We could use tokio's channel |
Ok, moved to the original issue at #141 (comment)
Oh, yes, i just remember that you already told me about this, my bad,
cool, |
I've tried this using below patch, the change is indeed small. Here are my simple observation when testing the performance using
I'm thinking why it worse:
diff --git a/mycelium/src/tun/linux.rs b/mycelium/src/tun/linux.rs
index 52b2b13..ccf5e9e 100644
--- a/mycelium/src/tun/linux.rs
+++ b/mycelium/src/tun/linux.rs
@@ -72,15 +72,35 @@ pub async fn new(
// Spawn a single task to manage the TUN interface
tokio::spawn(async move {
let mut buf_hold = None;
+ let mut recv_vec = Vec::new();
+ let num = 2;
+ info!("IBK create recv_vec with num={}", num);
loop {
let mut buf = if let Some(buf) = buf_hold.take() {
buf
} else {
PacketBuffer::new()
};
-
+ use std::io::IoSlice;
select! {
- data = sink_receiver.recv() => {
+ num_buf = sink_receiver.recv_many(&mut recv_vec, num) => {
+ if num_buf > 0 {
+ //info!("data = {}", data);
+ // Create IoSlice vector from PacketBuffers
+ let io_slices: Vec<IoSlice> = recv_vec.iter()
+ .map(|buf| IoSlice::new(&buf))
+ .collect();
+
+ if let Err(e) = tun.send_vectored(&io_slices).await {
+ error!("Failed to send data to tun interface {e}");
+ }
+ recv_vec.clear(); // Clear for next use
+ } else {
+ return; // Channel closed
+ }
+ buf_hold = Some(buf);
+ }
+ /*data = sink_receiver.recv() => {
match data {
None => return,
Some(data) => { |
It's unlikely that the implementation of recv_many is slower than that of recv, and the fact that setting num to 1 seems to support this point. There needs to be some detailed analysis with tools like perf/hotspot/samply,.... to figure out where the function is spending time. Note that tun.send_vectored requires IoSlices, and the current implementation which creates these actually creates these and collects them in a new vector, which means there is an allocation now on the hot path. If this is indeed an issue, it will be noticiable in a flamegraph |
Yes yes, i think both statements could be true:
Just like the current case #459 (comment):
But the problem here, the result on private node could be misleading, because IMO |
FYI, i discarded the work on TUN. Because Lee works on the routing part, i'm checking the data path. Now i'm digging deeper on how the peer's data path works. |
I've been collecting some data using SmokePing to get a sense for Mycelium performance from my perspective at home. A description of the methodology and everything needed to reproduce my approach is on this repo.
In summary, I'm connecting to all public Mycelium nodes via IPv4 TCP and pinging them periodically both over Mycelium and over IPv4. This was meant mostly to be a benchmark to evaluate other Mycelium hosts against, but it's revealed some trends that I think are worth highlighting. For most of the public nodes, I observe large variations in latency over Mycelium versus regular IPv4.
Here are some high level graphs, first showing the IPv4 ping performance to the public nodes. The line represents median ping time, with the "smoke" representing deviations from the median:
We can see that latency to these nodes is basically flatline with occasional minor deviations. This sample is representative of the data I've collected so far.
Here's the same view, but for pings sent over Mycelium:
Sometimes the behavior over Mycelium seems to be related an issue also seen on regular IPv4, but sometimes not. Here's an example from the SG node of a rather substantial latency spike on Mycelium:
But IPv4 looks rather clean over the same time period:
Here's a case where the issue observed on IPv4 seems to be amplified over Mycelium. Here we see relatively high packet loss in purple and pink:
Versus relatively low packet loss over IPv4 at the same time:
Here's a longer window, showing the large latency swings and a period of packet loss on Mycelium:
Versus IPv4:
So what I see overall is that median latency over Mycelium can vary by 100% hour to hour for directly connected public peers, while the medians over IPv4 to the same nodes tend to vary by no more than 5-10%.
It also appears that small amounts of packet loss on the underlay network get amplified into larger packet loss over Mycelium.
However, the latency to my closest public peer, US West, is much more stable. That could of course a be a coincidence, which could be cleared up by running the same test from different locations.
It's possible that latency is shifting along with load on the public nodes, and perhaps strain on the Mycelium process could cause these results. I don't have visibility to say whether that's happening, but it doesn't seem likely at the current level of exposure the project has and also the lack of clear correlation between nodes for times of high latency.
The text was updated successfully, but these errors were encountered: