Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TUN offloads #141

Open
LeeSmet opened this issue Feb 28, 2024 · 7 comments
Open

Implement TUN offloads #141

LeeSmet opened this issue Feb 28, 2024 · 7 comments
Labels
type_feature New feature or request

Comments

@LeeSmet
Copy link
Contributor

LeeSmet commented Feb 28, 2024

Performance profiles show that the biggest amount of time is currently spent in 3 places:

  • read from tun
  • write to tun
  • write to peers

One option which could improve the situation is #102, since larger packets naturally mean less syscalls (in the case of a tcp stream in the overlay). The problem there, is that larger packets will need to be fragmented if the lower layer link has a smaller MTU (which is the reason why MTU is currently set at 1400). While we currently only use stream based connections, keeping individual packets at MTU 1400 leaves the door for UDP at some point open (plain UDP that is).

The proper way to instead handle this would be to enable TSO (and USO and GRO, while we are at it). Unfortunately not a lot of info is readily available about this. In a first stage, we'll limit this to linux. From what I did manage to find so far:

  • Reading from the tun can produce a packet bigger than MTU (and similarly a bigger packet can be written)
  • Checksum offloading is required to be enabled.
  • We can implement ioctls on the tun created by the library, so we don't really have to write the tun code from scratch.
  • Ideally the vnet_hdr is enabled ( can be done at startup which requries library changes, or with an ioctl later it seems which is the preferred path for now). This puts a vnet_hdr struct with offload info at the start of the packet.
  • Since the segmentation boundary is defined in the packet, we can't do a vectored write into a bunch of packetbuffers. Instead we'll need to allocate a large buffer first, from which we then copy the data.
  • We can allocate this buffer once and reuse it.
  • Since we already have a single read/write loop setup, we can also reuse this buffer for both GRO and TSO/USO.
  • In theory, we can send unsegmented packets with the leading header to peers (if I'm given to understand this correctly), but then peers won't be able to handle the packet if they don't have offloading. So packets must be fragmented before sending, which also makes this backward compatible with legacy code
@iwanbk
Copy link
Member

iwanbk commented Apr 25, 2024

Performance profiles show that the biggest amount of time is currently spent in 3 places:

@LeeSmet

curious, how you did the profiling?

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Apr 25, 2024

In my global cargo config I have a section which specifies a profiling profile, which just adds debug symbols to the configured release profile of the project:

[profile.profiling]
inherits = "release"
debug = true

Then I build with cargo build --profile profiling. This binary is then run with samply (sudo -E samply record ./target/profiling/mycelium {args}). The resulting profile can then be inspected (uses firefox tracing UI by default) to see where the application spends its time)

@iwanbk
Copy link
Member

iwanbk commented Apr 25, 2024

nice 👍

@iwanbk
Copy link
Member

iwanbk commented Nov 26, 2024

continuing discussion on #459 (comment)

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

Wdyt about implementing it now @LeeSmet ?
I think we already have performance/scalability issue.

From quick search, to enable GRO we only need to enable the flag using ethtool, CMIIW

@iwanbk
Copy link
Member

iwanbk commented Dec 3, 2024

From quick search, to enable GRO we only need to enable the flag using ethtool, CMIIW

It is not that simple.

I found a blog post from Tailscale, they did this already and the result is good

https://tailscale.com/blog/more-throughput doesn't have the details but it provides some clue.
I think we can start from there.

@LeeSmet
Copy link
Contributor Author

LeeSmet commented Dec 3, 2024

Enabling GSO/GRO indeed requires setting some flags. The main difficulty is in handling the larger packets we then receive. Effectively we will receive packets larger than the MTU of the interface and we become responsible for splitting them. This means we also become responsible for populating the headers, most notably the checksum on the sending side. Sending is the easier part. On the receiving side, we need to reassemble the chunked packets into a larger packet again (not strictly necessary since the chunked packets must be valid packets themselves, after all we don't know if the target host has GSO enabled). But since packets can be mixed (perhaps we are receiving packets from 2 different sources), we need to meticulously select packets for reassembly by keeping track of the flows we are receiving.

All of this is pretty much complicated more by the fact that kernel documentation is pretty lackluster in how to actually implement it.

My idea is to start by creating our own TUN crate to the point we can replace the existing ones we use, and then implement this functionality.

Note that this is strictly a performance improvement for the sending and receiving node.

Also note that for performance, we'd want to do checksum calculations with vector instructions. This should not be too hard, in rust we can basically make a function for the calculation x amount of times, and anottate them with different vector extensions so we can rely on LLVM auto vectorization. Then in the main function we can do conditional function selection.

continuing discussion on #459 (comment)

The TUN offload issue has been parked as its purely an optimization for the send/receive nodes. It can be done in the future once the system is completely stable.

Wdyt about implementing it now @LeeSmet ?
I think we already have performance/scalability issue.

The things we observe in that issue are imo not related to performance of sending/receiving packets. In local tests (2 nodes on the same machine) I can push 2.5Gbit (iirc) on my laptop, 4Gbit at home

@iwanbk
Copy link
Member

iwanbk commented Dec 3, 2024

Wdyt about implementing it now @LeeSmet ?
I think we already have performance/scalability issue.

The things we observe in that issue are imo not related to performance of sending/receiving packets. In local tests (2 nodes on the same machine) I can push 2.5Gbit (iirc) on my laptop, 4Gbit at home

Actually it is not necessarily about solving that issue.
It is only to point out that we might already reach the point where scalability and performance should become our priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants