-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor sessions to use socket pool #815
Conversation
@markmandel I'm seeing key not found and permissions denied on the CI run, so I can't see what's failing.
|
You can also have a look at the GitHub check: https://github.com/googleforgames/quilkin/pull/815/checks?check_run_id=17577335488, the log is also there 😄 (I turned it on a while ago). |
You should also be able to access https://console.cloud.google.com/cloud-build/builds/efda5b5a-d699-4bd6-a652-b673102a1c24?project=328742829241 if your email is part of the Google Group, but it does look like the link just to the log is broken! Thanks for the heads up, I'll get that fixed. |
@markmandel Great thank you, I would have tested locally, but Quilkin has broken on macOS since the dual stack, and I keep forgetting to make an issue about it (doing so now). So CI is my main way to test network integration now. |
Oh that's really clever! So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2. Nice! |
464fcda
to
b8320dc
Compare
b8320dc
to
52a9e80
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
52a9e80
to
3a74489
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
3a74489
to
69c43db
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
Yes, in fact, they're guaranteed to overlap (or mux) as we only allocate new upstream sockets when all sockets for a server are reserved. |
69c43db
to
952bc3e
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
952bc3e
to
d77f535
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
d77f535
to
96c7ed7
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This commit refactors how we handle upstream connections to the gameservers. When profiling quilkin I noticed that there was a lot of time (~10–15%) being spent dropping the upstream socket through its Arc implementation that happened whenever a session was dropped. As I was thinking about how to solve this problem I also realised there was a second issue, which is that there is a limitation on how many connections Quilkin can hold at once, roughly ~16,383. Because after that we're likely to start encountering port exhaustion from the operating system, since each session is a unique socket. This brought me to the solution in this commit, which is that while we need to give each connection to the gameserver a unique port, we don't need to give a unique port across gameservers. So I refactored how we create sessions to use what I've called a "SessionPool". This pools the sockets for sessions into a map that is keyed by their destination. With this implementation this means that we now have a limit of ~16,000 connections per gameserver, which is far more than any gameserver could reasonably need.
96c7ed7
to
96eacaa
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
Makes a lot of sense. Nice optimisation! |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
e38356e
to
8d2daef
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
@markmandel CI seems stuck. |
🤔 that's a 2h timeout, with Does |
Yeah, the test it was stuck on was the benchmark and that's one that I didn't have a problem with. I had to update a lot of tests because previously we were storing |
Oh that's fun - but also makes sense. Previously we didn't care what the source adress was - if the mapped port got the traffic, we knew where it was going. In this world, the source address really matters, since we need that to know where to onsend packets going from endpoint back to client.
Also makes me realise we should have some timeouts with failures in our benchmarks, rather than just blocking forever. I am noting that the Nothing immediate otherwise stands out (assuming some weird ipv4/ipv6 mapping issue like you were stating), but can try running it locally on my linux machine and see if I can replicate. |
I'm running tests locally, and I'm getting quite inconsistent failures in unit tests on each run.... which makes me feel like there's a race condition somewhere.... Here's a few different runs:
My guess then on the benchmark is that it's falling into this same race condition where I expect data is being dropped/not routed correctly. |
8d2daef
to
71afddf
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
71afddf
to
97e2f04
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
I finally figured out, I was scouring the code, and using debuggers, and I couldn't find anything in the Rust code that was locking or blocking or anything, I was at wits end when I decided to just take the module and port it again to the code. That's when I realised that I allocated an extra socket on the same port for upstream sockets to send to, but it had no recv loop. So what was happening was that Linux was assigning data to that socket, and it was never being read. The fix was to share that initial socket for the upstream with the first downstream worker. |
97e2f04
to
d15251f
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
Build Succeeded 🥳 Build Id: 7d0fb86b-0d0a-4126-b331-ea616662de7c The following development images have been built, and will exist for the next 30 days: To build this version:
|
Glad you found it! Nice job! If it's any consolation I did exactly the same thing when doing the Socket |
This PR refactors how we handle upstream connections to the gameservers. When profiling quilkin I noticed that there was a lot of time (~10–15%) being spent dropping the upstream socket through its Arc implementation that happened whenever a session was dropped.
As I was thinking about how to solve this problem I also realised there was a second issue, which is that there is a limitation on how many connections Quilkin can hold at once, roughly ~16,383. Because after that we're likely to start encountering port exhaustion from the operating system, since each session is a unique socket.
This brought me to the solution in this PR, which is that while we need to give each connection to the gameserver a unique port, we don't need to give a unique port across gameservers. So I refactored how we create sessions to use what I've called a "SessionPool". This pools the sockets for sessions into a map that is keyed by their destination.
With this implementation this means that we now have a limit of ~16,000 connections per gameserver, which is far more than any gameserver could reasonably need.
Future Work
Add a limitation of connections per gameserver, if for example, you know the most you're going to have is ~50 players, setting it to be something like 75 seems reasonable. This would help prevent a large scale DDoS from causing port exhaustion (though it would still flood the existing sockets, so it wouldn't prevent a DDoS.)