Implement TUN offloads #141

LeeSmet · 2024-02-28T15:17:54Z

Performance profiles show that the biggest amount of time is currently spent in 3 places:

read from tun
write to tun
write to peers

One option which could improve the situation is #102, since larger packets naturally mean less syscalls (in the case of a tcp stream in the overlay). The problem there, is that larger packets will need to be fragmented if the lower layer link has a smaller MTU (which is the reason why MTU is currently set at 1400). While we currently only use stream based connections, keeping individual packets at MTU 1400 leaves the door for UDP at some point open (plain UDP that is).

The proper way to instead handle this would be to enable TSO (and USO and GRO, while we are at it). Unfortunately not a lot of info is readily available about this. In a first stage, we'll limit this to linux. From what I did manage to find so far:

Reading from the tun can produce a packet bigger than MTU (and similarly a bigger packet can be written)
Checksum offloading is required to be enabled.
We can implement ioctls on the tun created by the library, so we don't really have to write the tun code from scratch.
Ideally the vnet_hdr is enabled ( can be done at startup which requries library changes, or with an ioctl later it seems which is the preferred path for now). This puts a vnet_hdr struct with offload info at the start of the packet.
Since the segmentation boundary is defined in the packet, we can't do a vectored write into a bunch of packetbuffers. Instead we'll need to allocate a large buffer first, from which we then copy the data.
We can allocate this buffer once and reuse it.
Since we already have a single read/write loop setup, we can also reuse this buffer for both GRO and TSO/USO.
In theory, we can send unsegmented packets with the leading header to peers (if I'm given to understand this correctly), but then peers won't be able to handle the packet if they don't have offloading. So packets must be fragmented before sending, which also makes this backward compatible with legacy code

iwanbk · 2024-04-25T07:55:52Z

Performance profiles show that the biggest amount of time is currently spent in 3 places:

@LeeSmet

curious, how you did the profiling?

LeeSmet · 2024-04-25T08:24:23Z

In my global cargo config I have a section which specifies a profiling profile, which just adds debug symbols to the configured release profile of the project:

[profile.profiling]
inherits = "release"
debug = true

Then I build with cargo build --profile profiling. This binary is then run with samply (sudo -E samply record ./target/profiling/mycelium {args}). The resulting profile can then be inspected (uses firefox tracing UI by default) to see where the application spends its time)

iwanbk · 2024-04-25T08:33:39Z

nice 👍

iwanbk · 2024-11-26T07:04:02Z

continuing discussion on #459 (comment)

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

Wdyt about implementing it now @LeeSmet ?
I think we already have performance/scalability issue.

From quick search, to enable GRO we only need to enable the flag using ethtool, CMIIW

iwanbk · 2024-12-03T10:56:37Z

From quick search, to enable GRO we only need to enable the flag using ethtool, CMIIW

It is not that simple.

I found a blog post from Tailscale, they did this already and the result is good

https://tailscale.com/blog/more-throughput doesn't have the details but it provides some clue.
I think we can start from there.

LeeSmet · 2024-12-03T13:09:00Z

Enabling GSO/GRO indeed requires setting some flags. The main difficulty is in handling the larger packets we then receive. Effectively we will receive packets larger than the MTU of the interface and we become responsible for splitting them. This means we also become responsible for populating the headers, most notably the checksum on the sending side. Sending is the easier part. On the receiving side, we need to reassemble the chunked packets into a larger packet again (not strictly necessary since the chunked packets must be valid packets themselves, after all we don't know if the target host has GSO enabled). But since packets can be mixed (perhaps we are receiving packets from 2 different sources), we need to meticulously select packets for reassembly by keeping track of the flows we are receiving.

All of this is pretty much complicated more by the fact that kernel documentation is pretty lackluster in how to actually implement it.

My idea is to start by creating our own TUN crate to the point we can replace the existing ones we use, and then implement this functionality.

Note that this is strictly a performance improvement for the sending and receiving node.

Also note that for performance, we'd want to do checksum calculations with vector instructions. This should not be too hard, in rust we can basically make a function for the calculation x amount of times, and anottate them with different vector extensions so we can rely on LLVM auto vectorization. Then in the main function we can do conditional function selection.

continuing discussion on #459 (comment)

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

Wdyt about implementing it now @LeeSmet ?
I think we already have performance/scalability issue.

The things we observe in that issue are imo not related to performance of sending/receiving packets. In local tests (2 nodes on the same machine) I can push 2.5Gbit (iirc) on my laptop, 4Gbit at home

iwanbk · 2024-12-03T14:00:02Z

Wdyt about implementing it now @LeeSmet ?
I think we already have performance/scalability issue.

The things we observe in that issue are imo not related to performance of sending/receiving packets. In local tests (2 nodes on the same machine) I can push 2.5Gbit (iirc) on my laptop, 4Gbit at home

Actually it is not necessarily about solving that issue.
It is only to point out that we might already reach the point where scalability and performance should become our priority.

LeeSmet added the type_feature New feature or request label Feb 28, 2024

LeeSmet mentioned this issue Mar 4, 2024

Look at effect of increasing MTU of the TUN interface #102

Closed

LeeSmet mentioned this issue Apr 19, 2024

Create custom tun crate #213

Open

iwanbk mentioned this issue Nov 21, 2024

Performance data reports #459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TUN offloads #141

Implement TUN offloads #141

LeeSmet commented Feb 28, 2024

iwanbk commented Apr 25, 2024

LeeSmet commented Apr 25, 2024

iwanbk commented Apr 25, 2024

iwanbk commented Nov 26, 2024

iwanbk commented Dec 3, 2024 •

edited

Loading

LeeSmet commented Dec 3, 2024

iwanbk commented Dec 3, 2024

Implement TUN offloads #141

Implement TUN offloads #141

Comments

LeeSmet commented Feb 28, 2024

iwanbk commented Apr 25, 2024

LeeSmet commented Apr 25, 2024

iwanbk commented Apr 25, 2024

iwanbk commented Nov 26, 2024

iwanbk commented Dec 3, 2024 • edited Loading

LeeSmet commented Dec 3, 2024

iwanbk commented Dec 3, 2024

iwanbk commented Dec 3, 2024 •

edited

Loading