Refactor sessions to use socket pool #815

XAMPPRocky · 2023-10-10T19:14:55Z

This PR refactors how we handle upstream connections to the gameservers. When profiling quilkin I noticed that there was a lot of time (~10–15%) being spent dropping the upstream socket through its Arc implementation that happened whenever a session was dropped.

As I was thinking about how to solve this problem I also realised there was a second issue, which is that there is a limitation on how many connections Quilkin can hold at once, roughly ~16,383. Because after that we're likely to start encountering port exhaustion from the operating system, since each session is a unique socket.

This brought me to the solution in this PR, which is that while we need to give each connection to the gameserver a unique port, we don't need to give a unique port across gameservers. So I refactored how we create sessions to use what I've called a "SessionPool". This pools the sockets for sessions into a map that is keyed by their destination.

With this implementation this means that we now have a limit of ~16,000 connections per gameserver, which is far more than any gameserver could reasonably need.

Future Work

Add a limitation of connections per gameserver, if for example, you know the most you're going to have is ~50 players, setting it to be something like 75 seems reasonable. This would help prevent a large scale DDoS from causing port exhaustion (though it would still flood the existing sockets, so it wouldn't prevent a DDoS.)

XAMPPRocky · 2023-10-10T20:15:32Z

@markmandel I'm seeing key not found and permissions denied on the CI run, so I can't see what's failing.

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Details>
No such object: quilkin-build-logs/log-efda5b5a-d699-4bd6-a652-b673102a1c24.txt
</Details>
</Error>

markmandel · 2023-10-10T20:17:14Z

You can also have a look at the GitHub check: https://github.com/googleforgames/quilkin/pull/815/checks?check_run_id=17577335488, the log is also there 😄 (I turned it on a while ago).

markmandel · 2023-10-10T20:18:34Z

You should also be able to access https://console.cloud.google.com/cloud-build/builds/efda5b5a-d699-4bd6-a652-b673102a1c24?project=328742829241 if your email is part of the Google Group, but it does look like the link just to the log is broken! Thanks for the heads up, I'll get that fixed.

XAMPPRocky · 2023-10-10T20:22:32Z

@markmandel Great thank you, I would have tested locally, but Quilkin has broken on macOS since the dual stack, and I keep forgetting to make an issue about it (doing so now). So CI is my main way to test network integration now.

markmandel · 2023-10-10T22:23:51Z

Oh that's really clever!

So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2.

Nice!

github-actions · 2023-10-12T09:51:18Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-12T10:17:37Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-12T14:09:29Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

XAMPPRocky · 2023-10-12T14:11:30Z

So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2.

Yes, in fact, they're guaranteed to overlap (or mux) as we only allocate new upstream sockets when all sockets for a server are reserved.

github-actions · 2023-10-12T14:12:54Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-12T14:48:26Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-12T15:05:24Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

This commit refactors how we handle upstream connections to the gameservers. When profiling quilkin I noticed that there was a lot of time (~10–15%) being spent dropping the upstream socket through its Arc implementation that happened whenever a session was dropped. As I was thinking about how to solve this problem I also realised there was a second issue, which is that there is a limitation on how many connections Quilkin can hold at once, roughly ~16,383. Because after that we're likely to start encountering port exhaustion from the operating system, since each session is a unique socket. This brought me to the solution in this commit, which is that while we need to give each connection to the gameserver a unique port, we don't need to give a unique port across gameservers. So I refactored how we create sessions to use what I've called a "SessionPool". This pools the sockets for sessions into a map that is keyed by their destination. With this implementation this means that we now have a limit of ~16,000 connections per gameserver, which is far more than any gameserver could reasonably need.

github-actions · 2023-10-12T15:33:00Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-12T16:13:15Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

markmandel · 2023-10-12T17:34:02Z

So if we still have two clients talking to GameServer1, each client from the proxy will have it's own local port, but those local ports may overlap with the local ports of two other clients are using to talk to GameServer2.

Yes, in fact, they're guaranteed to overlap (or mux) as we only allocate new upstream sockets when all sockets for a server are reserved.

Makes a lot of sense. Nice optimisation!

github-actions · 2023-10-13T10:09:38Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-13T14:18:54Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-13T15:04:27Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

XAMPPRocky · 2023-10-13T16:45:36Z

@markmandel CI seems stuck.

markmandel · 2023-10-13T17:14:03Z

🤔 that's a 2h timeout, with 01:59:10 in test-quilkin. Looks like a test is getting stuck somewhere and not finishing.

Does cargo test complete for you locally?

XAMPPRocky · 2023-10-13T18:29:03Z

Yeah, the test it was stuck on was the benchmark and that's one that I didn't have a problem with.

I had to update a lot of tests because previously we were storing 0.0.0.0 in configuration but traffic never comes from that, so I had to map it to localhost

markmandel · 2023-10-13T18:42:28Z

I had to update a lot of tests because previously we were storing 0.0.0.0 in configuration but traffic never comes from that, so I had to map it to localhost

Oh that's fun - but also makes sense. Previously we didn't care what the source adress was - if the mapped port got the traffic, we knew where it was going.

In this world, the source address really matters, since we need that to know where to onsend packets going from endpoint back to client.

Yeah, the test it was stuck on was the benchmark and that's one that I didn't have a problem with.

Also makes me realise we should have some timeouts with failures in our benchmarks, rather than just blocking forever.

I am noting that the throughput_benchmark uses FEEDBACK_LOOP (just looking for difference to readwrite_benchmark which is working), but everything is using 127.0.0.1 as it's root address, so it should map correctly, I would have thought.

Nothing immediate otherwise stands out (assuming some weird ipv4/ipv6 mapping issue like you were stating), but can try running it locally on my linux machine and see if I can replicate.

markmandel · 2023-10-13T21:29:36Z

I'm running tests locally, and I'm getting quite inconsistent failures in unit tests on each run.... which makes me feel like there's a race condition somewhere....

Here's a few different runs:

...
test cli::proxy::tests::run_client ... FAILED
...
test proxy::sessions::tests::same_address_uses_different_sockets has been running for over 60 seconds

test load_balancer_filter ... FAILED

failures:

---- load_balancer_filter stdout ----
thread 'load_balancer_filter' panicked at tests/load_balancer.rs:77:14:
called `Result::unwrap()` on an `Err` value: Elapsed(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

...
test xds::tests::token_routing ... FAILED
thread 'xds::tests::token_routing' panicked at src/xds.rs:284:14:

My guess then on the benchmark is that it's falling into this same race condition where I expect data is being dropped/not routed correctly.

github-actions · 2023-10-15T09:15:55Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-10-15T21:38:17Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

XAMPPRocky · 2023-10-15T21:40:55Z

which makes me feel like there's a race condition somewhere

I finally figured out, I was scouring the code, and using debuggers, and I couldn't find anything in the Rust code that was locking or blocking or anything, I was at wits end when I decided to just take the module and port it again to the code. That's when I realised that I allocated an extra socket on the same port for upstream sockets to send to, but it had no recv loop. So what was happening was that Linux was assigning data to that socket, and it was never being read.

The fix was to share that initial socket for the upstream with the first downstream worker.

github-actions · 2023-10-15T22:02:07Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

koslib

lgtm

github-actions · 2023-10-16T11:52:16Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

quilkin-bot · 2023-10-16T12:10:25Z

Build Succeeded 🥳

Build Id: 7d0fb86b-0d0a-4126-b331-ea616662de7c

The following development images have been built, and will exist for the next 30 days:

us-docker.pkg.dev/quilkin/ci/quilkin:0.7.0-dev-2f97fa9

To build this version:

git fetch [email protected]:googleforgames/quilkin.git pull/815/head:pr_815 && git checkout pr_815
cargo build

markmandel · 2023-10-16T19:57:26Z

I finally figured out, I was scouring the code, and using debuggers, and I couldn't find anything in the Rust code that was locking or blocking or anything, I was at wits end when I decided to just take the module and port it again to the code. That's when I realised that I allocated an extra socket on the same port for upstream sockets to send to, but it had no recv loop. So what was happening was that Linux was assigning data to that socket, and it was never being read.

The fix was to share that initial socket for the upstream with the first downstream worker.

Glad you found it! Nice job!

If it's any consolation I did exactly the same thing when doing the Socket reuse_port work, and it took me a week (more?) to work it out 🤦🏻‍♂️

XAMPPRocky requested review from markmandel and koslib as code owners October 10, 2023 19:14

github-actions bot added the size/l label Oct 10, 2023

markmandel mentioned this pull request Oct 10, 2023

Cloud Build: Put logs back in public bucket #817

Merged

XAMPPRocky force-pushed the ep/socket-pool branch from 464fcda to b8320dc Compare October 11, 2023 07:21

XAMPPRocky enabled auto-merge (squash) October 11, 2023 07:22

XAMPPRocky force-pushed the ep/socket-pool branch from b8320dc to 52a9e80 Compare October 12, 2023 09:50

github-actions bot added size/xl and removed size/l labels Oct 12, 2023

XAMPPRocky force-pushed the ep/socket-pool branch from 52a9e80 to 3a74489 Compare October 12, 2023 10:17

XAMPPRocky force-pushed the ep/socket-pool branch from 3a74489 to 69c43db Compare October 12, 2023 14:09

XAMPPRocky force-pushed the ep/socket-pool branch from 69c43db to 952bc3e Compare October 12, 2023 14:12

XAMPPRocky force-pushed the ep/socket-pool branch from 952bc3e to d77f535 Compare October 12, 2023 14:48

XAMPPRocky force-pushed the ep/socket-pool branch from d77f535 to 96c7ed7 Compare October 12, 2023 15:05

XAMPPRocky force-pushed the ep/socket-pool branch from 96c7ed7 to 96eacaa Compare October 12, 2023 15:32

canonise ip addresses

43eafa0

fix localhost mapping in config

40bcbf8

fix integration tests

e38356e

XAMPPRocky force-pushed the ep/socket-pool branch from e38356e to 8d2daef Compare October 13, 2023 15:04

XAMPPRocky force-pushed the ep/socket-pool branch from 8d2daef to 71afddf Compare October 15, 2023 09:15

XAMPPRocky force-pushed the ep/socket-pool branch from 71afddf to 97e2f04 Compare October 15, 2023 21:38

actually fix it

d15251f

XAMPPRocky force-pushed the ep/socket-pool branch from 97e2f04 to d15251f Compare October 15, 2023 22:01

koslib approved these changes Oct 16, 2023

View reviewed changes

Merge branch 'main' into ep/socket-pool

2f97fa9

XAMPPRocky merged commit 439df95 into main Oct 16, 2023
3 checks passed

markmandel deleted the ep/socket-pool branch October 16, 2023 19:55

markmandel added the kind/feature New feature or request label Oct 17, 2023

markmandel mentioned this pull request May 9, 2024

Proxy doesn't remember which IP address it received on when sending packets back to the client #936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor sessions to use socket pool #815

Refactor sessions to use socket pool #815

XAMPPRocky commented Oct 10, 2023

XAMPPRocky commented Oct 10, 2023

markmandel commented Oct 10, 2023

markmandel commented Oct 10, 2023 •

edited

Loading

XAMPPRocky commented Oct 10, 2023

markmandel commented Oct 10, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

XAMPPRocky commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

markmandel commented Oct 12, 2023

github-actions bot commented Oct 13, 2023

github-actions bot commented Oct 13, 2023

github-actions bot commented Oct 13, 2023

XAMPPRocky commented Oct 13, 2023

markmandel commented Oct 13, 2023

XAMPPRocky commented Oct 13, 2023

markmandel commented Oct 13, 2023 •

edited

Loading

markmandel commented Oct 13, 2023

github-actions bot commented Oct 15, 2023

github-actions bot commented Oct 15, 2023

XAMPPRocky commented Oct 15, 2023 •

edited

Loading

github-actions bot commented Oct 15, 2023

koslib left a comment

github-actions bot commented Oct 16, 2023

quilkin-bot commented Oct 16, 2023

markmandel commented Oct 16, 2023

Refactor sessions to use socket pool #815

Refactor sessions to use socket pool #815

Conversation

XAMPPRocky commented Oct 10, 2023

Future Work

XAMPPRocky commented Oct 10, 2023

markmandel commented Oct 10, 2023

markmandel commented Oct 10, 2023 • edited Loading

XAMPPRocky commented Oct 10, 2023

markmandel commented Oct 10, 2023 • edited Loading

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

XAMPPRocky commented Oct 12, 2023 • edited Loading

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

markmandel commented Oct 12, 2023

github-actions bot commented Oct 13, 2023

github-actions bot commented Oct 13, 2023

github-actions bot commented Oct 13, 2023

XAMPPRocky commented Oct 13, 2023

markmandel commented Oct 13, 2023

XAMPPRocky commented Oct 13, 2023

markmandel commented Oct 13, 2023 • edited Loading

markmandel commented Oct 13, 2023

github-actions bot commented Oct 15, 2023

github-actions bot commented Oct 15, 2023

XAMPPRocky commented Oct 15, 2023 • edited Loading

github-actions bot commented Oct 15, 2023

koslib left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 16, 2023

quilkin-bot commented Oct 16, 2023

markmandel commented Oct 16, 2023

markmandel commented Oct 10, 2023 •

edited

Loading

markmandel commented Oct 10, 2023 •

edited

Loading

XAMPPRocky commented Oct 12, 2023 •

edited

Loading

markmandel commented Oct 13, 2023 •

edited

Loading

XAMPPRocky commented Oct 15, 2023 •

edited

Loading