-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement TUN offloads #141
Comments
curious, how you did the profiling? |
In my global cargo config I have a section which specifies a [profile.profiling]
inherits = "release"
debug = true Then I build with |
nice 👍 |
continuing discussion on #459 (comment)
Wdyt about implementing it now @LeeSmet ? From quick search, to enable GRO we only need to enable the flag using ethtool, CMIIW |
It is not that simple. I found a blog post from Tailscale, they did this already and the result is good https://tailscale.com/blog/more-throughput doesn't have the details but it provides some clue. |
Enabling GSO/GRO indeed requires setting some flags. The main difficulty is in handling the larger packets we then receive. Effectively we will receive packets larger than the MTU of the interface and we become responsible for splitting them. This means we also become responsible for populating the headers, most notably the checksum on the sending side. Sending is the easier part. On the receiving side, we need to reassemble the chunked packets into a larger packet again (not strictly necessary since the chunked packets must be valid packets themselves, after all we don't know if the target host has GSO enabled). But since packets can be mixed (perhaps we are receiving packets from 2 different sources), we need to meticulously select packets for reassembly by keeping track of the flows we are receiving. All of this is pretty much complicated more by the fact that kernel documentation is pretty lackluster in how to actually implement it. My idea is to start by creating our own TUN crate to the point we can replace the existing ones we use, and then implement this functionality. Note that this is strictly a performance improvement for the sending and receiving node. Also note that for performance, we'd want to do checksum calculations with vector instructions. This should not be too hard, in rust we can basically make a function for the calculation x amount of times, and anottate them with different vector extensions so we can rely on LLVM auto vectorization. Then in the main function we can do conditional function selection.
The things we observe in that issue are imo not related to performance of sending/receiving packets. In local tests (2 nodes on the same machine) I can push 2.5Gbit (iirc) on my laptop, 4Gbit at home |
Actually it is not necessarily about solving that issue. |
Performance profiles show that the biggest amount of time is currently spent in 3 places:
One option which could improve the situation is #102, since larger packets naturally mean less syscalls (in the case of a tcp stream in the overlay). The problem there, is that larger packets will need to be fragmented if the lower layer link has a smaller MTU (which is the reason why MTU is currently set at 1400). While we currently only use stream based connections, keeping individual packets at MTU 1400 leaves the door for UDP at some point open (plain UDP that is).
The proper way to instead handle this would be to enable TSO (and USO and GRO, while we are at it). Unfortunately not a lot of info is readily available about this. In a first stage, we'll limit this to linux. From what I did manage to find so far:
The text was updated successfully, but these errors were encountered: