Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(plugin):router_limits plugin added #6598

Merged
merged 115 commits into from
Jan 29, 2025
Merged

Conversation

aaronArinder
Copy link
Contributor

@aaronArinder aaronArinder commented Jan 21, 2025

router limits plugin for limiting the router based on what's in the user's license, starting with tps limits

NB: this includes the work from #6561


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

garypen and others added 30 commits December 18, 2024 17:03
There's a long way to go, but some interesting aspects of the required
changes are implemented here.

I noted that a significant number of errors and glitches and tests are
resolved by making these changes. Mainly places where we were calling a
service without readying it first or places where we were cloning an
inner service without doing the memory "replace" dance.

Because we are using a "trick" to make the router service cloneable
right now, there are a few tests which don't work properly. I think for
the "full" work, we'll need to make the router service properly
cloneable (without requiring a mutex).

This will require some fairly substantial re-working of a wide variety
of services and layers. On the plus side, once we've done that work
we'll be able to retire a bunch of code that we've written that we will
no longer require.

I'll pick this up in the New Year...
That test is just old on dev/next, so fixing it makes sense

Also: add a note to coprocessor test changes to remind me that I need to
understand what is happening there before this branch can merge.
re-order use statements to keep cargo fmt happy
To see how far off passing we are in CI
IMPORTANT:

This change modifies the supergraph method invocation test to be a
router_service service invocation test.

Amongst other important details (such as we are now really testing the
service through the full pipeline), it's important to note that we can't
use `oneshot()` and just re-create the service every time we want to
call it. If we do, then the rate limiting details are lost. So, we must
re-use our service to make sure that state isn't lost.
Also:
 - update snapshot so it knows about concurrency
 - replace a bad supergraph test with a broken router test
 - re-order traffic shaping so that timeouts > concurrency > rate-limit
The code added yesterday was creating errors as data and using numeric
codes (rather than the magic strings in 1.x). Re-instated the 1.x
behaviour for reporting errors and also fixed the new timeout test.
Since we now fail traffic shaping tests at the router_service, they
are not counted as graphql_errors (which are only processed, as they
should be at the supergraph_service).

IMO, these should never have been counted as graphql errors anyway,
since they clearly aren't graphql errors but are traffic shaping (rate
limit, timeout, etc...) errors. We'll still report them to the user as a
504 or a 503 or whatever, but they won't count towards the graphql_error
metric.

I've also updated a snapshot to reflect the error message we now
provide.
Not required for GA. Implement later as a separate project.
This breaks all the http_post_mutation tests because of changes in
expectation.
The AsyncCheckpoint layer was using `oneshot` and wasn't calling the
prepared service. I've fixed that.

This affects some of the tests, so I've fixed them as well.
Make it pass until subgraph rate limiting is changed. We'll need to
update the test agains at that point.
I've been adding `buffer(50_000)` across the code base. Now replacing
with `buffered` (one of our existing layers from our ServiceBuilderExt
along with a helpful comment that it still needs some work.
It's just about possible to extend backpressure from the router service
to the qp service with the caveat that batches may subvert backpressure.

Beyond the qp, it's difficult to control because of the way the
execution engine operates.

This comment has been minimized.

@@ -795,10 +823,31 @@ mod test {

register_plugin!("test", "always_fails_to_start", AlwaysFailsToStartPlugin);

async fn create_service(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this above the tests for (at least to me) cleaner organization; down to move it back if that annoys folks

@@ -795,10 +821,31 @@ mod test {

register_plugin!("test", "always_fails_to_start", AlwaysFailsToStartPlugin);

async fn create_service(
config: Configuration,
license: Option<LicenseState>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add license to this as none of the tests use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! fixed

pub(crate) struct TpsLimit {
pub(crate) capacity: usize,

#[serde(deserialize_with = "deserialize_instant", rename = "durationMs")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use #[serde(deserialize_with = "humantime_serde::deserialize", default)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it's a duration not an instant :)

@BrynCooke
Copy link
Contributor

Approved, make sure you have adequately manually tested!

@aaronArinder aaronArinder enabled auto-merge (squash) January 29, 2025 13:20
@aaronArinder aaronArinder merged commit 419f2bd into dev Jan 29, 2025
15 checks passed
@aaronArinder aaronArinder deleted the aaron/BGS-188/router_limits branch January 29, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants