-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporarily retain endpoints which are in active use #777
Conversation
071b23d
to
2af4534
Compare
Hmnn. That's super interesting. I actually wonder if we solve this in the same way we solve proxies not being available behind a loadbalancer if they have no endpoints. Lines 98 to 106 in 1fc3954
Which is to say a relay, agent, or management server can't return The downside is - if there are legitimately no Endpoints, then that can't be communicated to the proxies 🤔 so maybe that's not the best approach. Unless there is some way on the relay / agent / etc layer to not return true from Did that make sense? Basically I think we should manage this in the |
src/endpoint/locality.rs
Outdated
// This will mean that updated metadata such as new tokens | ||
// will be respected, but we will still retain older | ||
// endpoints that are currently actively used in a session. | ||
Entry::Occupied(entry) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dangerous? Basically if an endpoint is getting a malicious traffic and the solution the end user wants to implement is to drop the endpoint entirely to protect the internal resources -- doesn't this mean that that won't work while the malicious traffic is continuing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically if an endpoint is getting a malicious traffic and the solution the end user wants to implement is to drop the endpoint entirely to protect the internal resources -- doesn't this mean that that won't work while the malicious traffic is continuing?
Well yeah, that is why we shouldn't accept this version. I'll write up my response to your other questions/proposed solution in a separate comment, but I think what we need is to be able to have different verbs here. We need to be able to distinct between "services are disrupted or a bug has caused a endpoint to be lost in the chain" where we want the load balancer to retain that endpoint because we don't want player's to be dropped mid-game, and "stop sending traffic to this endpoint because we want to stop traffic to it".
a0d6458
to
8c2fc7d
Compare
So I don't like that approach for couple reasons.
I think what we need is new verbs for communicating changes to load balancers, so that's actually clear when we intentionally want to delete or change endpoints, and when there's potentially a service disruption, misconfiguration, or a bug, and have the load balancers handle those differently, as I think absent an intentionality (from removing a token, using a firewall, etc). The load balancer should strive to allow player's to continue playing to the end of their session. This could be having "Delta xDS" replace our current xDS protocol, so that we're communicating changes, and as added benefit we're reducing the amount of traffic we're communicating as we scale. |
Ultimately that's probably the better path, I think you are right. If a proxy hasn't receive the request to remove something, then it will just keep it. If the relay server crashes, and a agent behind that also happened to crash at the same time - the relay can't tell any of the proxies to delete something that it wasn't aware of in the first place. On the management server / agent perspective - I would say it still make sense to have a
At the agent / management layer, I don't think this is a concern, because fast connections to systems that have bad data don't help anyone. We could block on the gRPC request until the initial check completes, but at that point we're introducing the same delay anyway? Or maybe a check at both layers is good, just to be extra sure. I'd lead more towards optimising more on correct information than speed of delivery for this type of scenario. |
Well actually I do like that better, because that delay is only on initialisation right? It wouldn't be whenever there's no endpoints. Even though I think you're right from the agent / management server perspective that it mostly doesn't matter, and what matters is that it's returning data. From the load balancer side, it's good to see stuff like "warning cannot connect to " whenever it has zero endpoints, but having the check be until it receives the first update from k8s does prevent that from happening outside of startup. |
a1d7b28
to
3dce130
Compare
b2b0ff3
to
b748386
Compare
7730eeb
to
5692a55
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
Build Failed 😭 Build Id: f72cf2aa-6398-4db3-b8c2-e48fa8f082fa Status: FAILURE To get permission to view the Cloud Build view, join the quilkin-discuss Google Group. Filter with the Git Commit Development images are retained for at least 30 days. |
I don't think should be the actual solution, but I needed to test it, and figure we can also use the PR to discuss how to solve the problem in a better way.
Consider the following infrastructure
proxy -> management server
proxy -> relay <- agent
If the connection to the relay or management server, or the agent is disrupted for some reason, and those services need to be restarted, there's a chance that the proxy or relay will receive a response before the agent or management server has received information from k8s, which would lead to the proxy clearing its all of its endpoints, which will lead to player's traffic being dropped for that period of time.
This PR changes that behaviour by adding a shared atomic counter to endpoints, and when it merges new configuration will retain the old endpoints that are currently in use. It will still replace their metadata so if we receive the same endpoint, but we'll keep any in-use endpoints for their session duration.