Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor sessions to use socket pool #815

Merged
merged 8 commits into from
Oct 16, 2023
Merged

Refactor sessions to use socket pool #815

merged 8 commits into from
Oct 16, 2023

Conversation

XAMPPRocky
Copy link
Collaborator

This PR refactors how we handle upstream connections to the gameservers. When profiling quilkin I noticed that there was a lot of time (~10–15%) being spent dropping the upstream socket through its Arc implementation that happened whenever a session was dropped.

As I was thinking about how to solve this problem I also realised there was a second issue, which is that there is a limitation on how many connections Quilkin can hold at once, roughly ~16,383. Because after that we're likely to start encountering port exhaustion from the operating system, since each session is a unique socket.

This brought me to the solution in this PR, which is that while we need to give each connection to the gameserver a unique port, we don't need to give a unique port across gameservers. So I refactored how we create sessions to use what I've called a "SessionPool". This pools the sockets for sessions into a map that is keyed by their destination.

With this implementation this means that we now have a limit of ~16,000 connections per gameserver, which is far more than any gameserver could reasonably need.

Future Work

Add a limitation of connections per gameserver, if for example, you know the most you're going to have is ~50 players, setting it to be something like 75 seems reasonable. This would help prevent a large scale DDoS from causing port exhaustion (though it would still flood the existing sockets, so it wouldn't prevent a DDoS.)

@XAMPPRocky
Copy link
Collaborator Author

@markmandel I'm seeing key not found and permissions denied on the CI run, so I can't see what's failing.

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Details>
No such object: quilkin-build-logs/log-efda5b5a-d699-4bd6-a652-b673102a1c24.txt
</Details>
</Error>

@markmandel
Copy link
Member

You can also have a look at the GitHub check: https://github.com/googleforgames/quilkin/pull/815/checks?check_run_id=17577335488, the log is also there 😄 (I turned it on a while ago).

@markmandel
Copy link
Member

markmandel commented Oct 10, 2023

You should also be able to access https://console.cloud.google.com/cloud-build/builds/efda5b5a-d699-4bd6-a652-b673102a1c24?project=328742829241 if your email is part of the Google Group, but it does look like the link just to the log is broken! Thanks for the heads up, I'll get that fixed.

@XAMPPRocky
Copy link
Collaborator Author

@markmandel Great thank you, I would have tested locally, but Quilkin has broken on macOS since the dual stack, and I keep forgetting to make an issue about it (doing so now). So CI is my main way to test network integration now.

@markmandel
Copy link
Member

markmandel commented Oct 10, 2023

Oh that's really clever!

So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2.

Nice!

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@XAMPPRocky
Copy link
Collaborator Author

XAMPPRocky commented Oct 12, 2023

So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2.

Yes, in fact, they're guaranteed to overlap (or mux) as we only allocate new upstream sockets when all sockets for a server are reserved.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

This commit refactors how we handle upstream connections to the
gameservers. When profiling quilkin I noticed that there was a lot of
time (~10–15%) being spent dropping the upstream socket through its Arc
implementation that happened whenever a session was dropped.

As I was thinking about how to solve this problem I also realised there
was a second issue, which is that there is a limitation on how many
connections Quilkin can hold at once, roughly ~16,383. Because after
that we're likely to start encountering port exhaustion from the
operating system, since each session is a unique socket.

This brought me to the solution in this commit, which is that while we
need to give each connection to the gameserver a unique port, we don't
need to give a unique port across gameservers. So I refactored how we
create sessions to use what I've called a "SessionPool". This pools the
sockets for sessions into a map that is keyed by their destination.

With this implementation this means that we now have a limit of ~16,000
connections per gameserver, which is far more than any gameserver could
reasonably need.
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@markmandel
Copy link
Member

So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2.

Yes, in fact, they're guaranteed to overlap (or mux) as we only allocate new upstream sockets when all sockets for a server are reserved.

Makes a lot of sense. Nice optimisation!

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@XAMPPRocky
Copy link
Collaborator Author

@markmandel CI seems stuck.

@markmandel
Copy link
Member

🤔 that's a 2h timeout, with 01:59:10 in test-quilkin. Looks like a test is getting stuck somewhere and not finishing.

Does cargo test complete for you locally?

@XAMPPRocky
Copy link
Collaborator Author

Yeah, the test it was stuck on was the benchmark and that's one that I didn't have a problem with.

I had to update a lot of tests because previously we were storing 0.0.0.0 in configuration but traffic never comes from that, so I had to map it to localhost

@markmandel
Copy link
Member

markmandel commented Oct 13, 2023

I had to update a lot of tests because previously we were storing 0.0.0.0 in configuration but traffic never comes from that, so I had to map it to localhost

Oh that's fun - but also makes sense. Previously we didn't care what the source adress was - if the mapped port got the traffic, we knew where it was going.

In this world, the source address really matters, since we need that to know where to onsend packets going from endpoint back to client.

Yeah, the test it was stuck on was the benchmark and that's one that I didn't have a problem with.

Also makes me realise we should have some timeouts with failures in our benchmarks, rather than just blocking forever.

I am noting that the throughput_benchmark uses FEEDBACK_LOOP (just looking for difference to readwrite_benchmark which is working), but everything is using 127.0.0.1 as it's root address, so it should map correctly, I would have thought.

Nothing immediate otherwise stands out (assuming some weird ipv4/ipv6 mapping issue like you were stating), but can try running it locally on my linux machine and see if I can replicate.

@markmandel
Copy link
Member

I'm running tests locally, and I'm getting quite inconsistent failures in unit tests on each run.... which makes me feel like there's a race condition somewhere....

Here's a few different runs:

...
test cli::proxy::tests::run_client ... FAILED
...
test proxy::sessions::tests::same_address_uses_different_sockets has been running for over 60 seconds
test load_balancer_filter ... FAILED

failures:

---- load_balancer_filter stdout ----
thread 'load_balancer_filter' panicked at tests/load_balancer.rs:77:14:
called `Result::unwrap()` on an `Err` value: Elapsed(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
test xds::tests::token_routing ... FAILED
thread 'xds::tests::token_routing' panicked at src/xds.rs:284:14:

My guess then on the benchmark is that it's falling into this same race condition where I expect data is being dropped/not routed correctly.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@XAMPPRocky
Copy link
Collaborator Author

XAMPPRocky commented Oct 15, 2023

which makes me feel like there's a race condition somewhere

I finally figured out, I was scouring the code, and using debuggers, and I couldn't find anything in the Rust code that was locking or blocking or anything, I was at wits end when I decided to just take the module and port it again to the code. That's when I realised that I allocated an extra socket on the same port for upstream sockets to send to, but it had no recv loop. So what was happening was that Linux was assigning data to that socket, and it was never being read.

The fix was to share that initial socket for the upstream with the first downstream worker.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Copy link
Collaborator

@koslib koslib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@quilkin-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: 7d0fb86b-0d0a-4126-b331-ea616662de7c

The following development images have been built, and will exist for the next 30 days:

To build this version:

git fetch [email protected]:googleforgames/quilkin.git pull/815/head:pr_815 && git checkout pr_815
cargo build

@XAMPPRocky XAMPPRocky merged commit 439df95 into main Oct 16, 2023
3 checks passed
@markmandel markmandel deleted the ep/socket-pool branch October 16, 2023 19:55
@markmandel
Copy link
Member

I finally figured out, I was scouring the code, and using debuggers, and I couldn't find anything in the Rust code that was locking or blocking or anything, I was at wits end when I decided to just take the module and port it again to the code. That's when I realised that I allocated an extra socket on the same port for upstream sockets to send to, but it had no recv loop. So what was happening was that Linux was assigning data to that socket, and it was never being read.

The fix was to share that initial socket for the upstream with the first downstream worker.

Glad you found it! Nice job!

If it's any consolation I did exactly the same thing when doing the Socket reuse_port work, and it took me a week (more?) to work it out 🤦🏻‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request size/xl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants