Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporarily retain endpoints which are in active use #777

Closed
wants to merge 31 commits into from

Conversation

XAMPPRocky
Copy link
Collaborator

I don't think should be the actual solution, but I needed to test it, and figure we can also use the PR to discuss how to solve the problem in a better way.

Consider the following infrastructure

proxy -> management server
proxy -> relay <- agent

If the connection to the relay or management server, or the agent is disrupted for some reason, and those services need to be restarted, there's a chance that the proxy or relay will receive a response before the agent or management server has received information from k8s, which would lead to the proxy clearing its all of its endpoints, which will lead to player's traffic being dropped for that period of time.

This PR changes that behaviour by adding a shared atomic counter to endpoints, and when it merges new configuration will retain the old endpoints that are currently in use. It will still replace their metadata so if we receive the same endpoint, but we'll keep any in-use endpoints for their session duration.

@XAMPPRocky XAMPPRocky force-pushed the ep/retain-endpoints branch 2 times, most recently from 071b23d to 2af4534 Compare August 23, 2023 07:38
@markmandel
Copy link
Contributor

Hmnn. That's super interesting.

I actually wonder if we solve this in the same way we solve proxies not being available behind a loadbalancer if they have no endpoints.

quilkin/src/admin.rs

Lines 98 to 106 in 1fc3954

fn check_proxy_readiness(config: &Config) -> Response<Body> {
if config.clusters.read().endpoints().count() > 0 {
return Response::new("ok".into());
}
let mut response = Response::new(Body::empty());
*response.status_mut() = StatusCode::INTERNAL_SERVER_ERROR;
response
}

Which is to say a relay, agent, or management server can't return OK from /ready unless it has some endpoints. So if the proxy does restart for some reason, it (in theory) won't get added back into the LB until it has some sort of configuration.

The downside is - if there are legitimately no Endpoints, then that can't be communicated to the proxies 🤔 so maybe that's not the best approach.

Unless there is some way on the relay / agent / etc layer to not return true from /ready until they can say "hey, I've gone and checked at least once from the config I need to check from" and therefore won't send data back up to the proxy?

Did that make sense? Basically I think we should manage this in the '/ready handler since that's what will determine if it's endpoint (and it's configuration) is available to the outside world.

// This will mean that updated metadata such as new tokens
// will be respected, but we will still retain older
// endpoints that are currently actively used in a session.
Entry::Occupied(entry) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this dangerous? Basically if an endpoint is getting a malicious traffic and the solution the end user wants to implement is to drop the endpoint entirely to protect the internal resources -- doesn't this mean that that won't work while the malicious traffic is continuing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically if an endpoint is getting a malicious traffic and the solution the end user wants to implement is to drop the endpoint entirely to protect the internal resources -- doesn't this mean that that won't work while the malicious traffic is continuing?

Well yeah, that is why we shouldn't accept this version. I'll write up my response to your other questions/proposed solution in a separate comment, but I think what we need is to be able to have different verbs here. We need to be able to distinct between "services are disrupted or a bug has caused a endpoint to be lost in the chain" where we want the load balancer to retain that endpoint because we don't want player's to be dropped mid-game, and "stop sending traffic to this endpoint because we want to stop traffic to it".

@XAMPPRocky XAMPPRocky force-pushed the ep/retain-endpoints branch 2 times, most recently from a0d6458 to 8c2fc7d Compare August 25, 2023 12:16
@XAMPPRocky
Copy link
Collaborator Author

XAMPPRocky commented Aug 25, 2023

Did that make sense? Basically I think we should manage this in the '/ready handler since that's what will determine if it's endpoint (and it's configuration) is available to the outside world.

So I don't like that approach for couple reasons.

  • gRPC Connection won't initiate until /ready has completed. This does mean that there will be additional latency or spurious unresolvable connection failures between services. When games are allocated we should want to have that information distributed as quickly as possible.
  • Works for management servers, not for relay. Even though it won't initiate until it has servers, in a multi-cluster setup, what we're more likely to run into is that one cluster goes down, and the load balancer won't be aware of that because it's getting its information from the relay rather than directly from the management server, it will just see an incomplete config.

I think what we need is new verbs for communicating changes to load balancers, so that's actually clear when we intentionally want to delete or change endpoints, and when there's potentially a service disruption, misconfiguration, or a bug, and have the load balancers handle those differently, as I think absent an intentionality (from removing a token, using a firewall, etc). The load balancer should strive to allow player's to continue playing to the end of their session.

This could be having "Delta xDS" replace our current xDS protocol, so that we're communicating changes, and as added benefit we're reducing the amount of traffic we're communicating as we scale.

@markmandel
Copy link
Contributor

This could be having "Delta xDS" replace our current xDS protocol, so that we're communicating changes, and as added benefit we're reducing the amount of traffic we're communicating as we scale.

Ultimately that's probably the better path, I think you are right.

If a proxy hasn't receive the request to remove something, then it will just keep it. If the relay server crashes, and a agent behind that also happened to crash at the same time - the relay can't tell any of the proxies to delete something that it wasn't aware of in the first place.

On the management server / agent perspective - I would say it still make sense to have a /ready that only returns true one it has checked and parsed the initial state of whatever world it's looking at. That way at least it can't send out an empty config before it actually has information. That at least solves issues at that layer.

gRPC Connection won't initiate until /ready has completed. This does mean that there will be additional latency or spurious unresolvable connection failures between services. When games are allocated we should want to have that information distributed as quickly as possible.

At the agent / management layer, I don't think this is a concern, because fast connections to systems that have bad data don't help anyone. We could block on the gRPC request until the initial check completes, but at that point we're introducing the same delay anyway? Or maybe a check at both layers is good, just to be extra sure. I'd lead more towards optimising more on correct information than speed of delivery for this type of scenario.

@XAMPPRocky
Copy link
Collaborator Author

We could block on the gRPC request until the initial check completes, but at that point we're introducing the same delay anyway?

Well actually I do like that better, because that delay is only on initialisation right? It wouldn't be whenever there's no endpoints. Even though I think you're right from the agent / management server perspective that it mostly doesn't matter, and what matters is that it's returning data. From the load balancer side, it's good to see stuff like "warning cannot connect to " whenever it has zero endpoints, but having the check be until it receives the first update from k8s does prevent that from happening outside of startup.

@XAMPPRocky XAMPPRocky force-pushed the ep/retain-endpoints branch 2 times, most recently from a1d7b28 to 3dce130 Compare September 21, 2023 08:50
@XAMPPRocky XAMPPRocky requested a review from koslib as a code owner October 15, 2023 18:18
@XAMPPRocky XAMPPRocky enabled auto-merge (squash) October 18, 2023 08:41
@XAMPPRocky XAMPPRocky disabled auto-merge October 18, 2023 08:41
@XAMPPRocky XAMPPRocky force-pushed the ep/retain-endpoints branch 7 times, most recently from 7730eeb to 5692a55 Compare October 22, 2023 11:24
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@quilkin-bot
Copy link
Collaborator

Build Failed 😭

Build Id: f72cf2aa-6398-4db3-b8c2-e48fa8f082fa

Status: FAILURE

To get permission to view the Cloud Build view, join the quilkin-discuss Google Group.

Filter with the Git Commit 072fd80 within us-docker.pkg.dev/quilkin/ci//quilkin to see if a development image is available from this build, and can be used for debugging purposes.

Development images are retained for at least 30 days.

@XAMPPRocky
Copy link
Collaborator Author

Closing as superseded by #850 and #836

@XAMPPRocky XAMPPRocky closed this Nov 15, 2023
@markmandel markmandel deleted the ep/retain-endpoints branch March 11, 2024 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants