Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

test: Add test for global rate limiting with load balancing #207

Closed
wants to merge 1 commit into from

Conversation

philipgough
Copy link
Contributor

@philipgough philipgough commented Dec 12, 2023

Draft PR to replicate the behaviour/bug seen in #208

@philipgough philipgough marked this pull request as ready for review December 12, 2023 14:36
@philipgough philipgough marked this pull request as draft December 12, 2023 14:36
@@ -859,6 +864,62 @@ func TestGlobalRateLimits(t *testing.T) {
})
}

func TestGlobalRateLimitsWithLoadBalancing(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a non-maintainer of this repo, it looks like a code smell to me that the easiest way to assert this behavior is through a functional test and not a unit test. 🤔

@thrawn01
Copy link
Contributor

What are we testing here? GPRC's load balancing or the ability for gubernator to distribute the hits to the rest of the cluster?

In reference to unit testing vs functional. If we can have a unit test which tests the ahem functionality without violating public interfaces, then we can do so. Any test which violates the public interface is suspect as it limits your ability to refactor in the future. Functional Tests > Unit Tests.

You can read my thoughts on this here https://wippler.dev/posts/Testing-private-methods

@thrawn01
Copy link
Contributor

ah, I didn't realize this PR was related to #208 👍

@philipgough
Copy link
Contributor Author

Yep @thrawn01 - I just created this to reproduce the exact issue I reported in #208 for clarity!

client := guber.NewV1Client(conn)

sendHit := func(status guber.Status, assertion func(resp *guber.RateLimitResp), i int) string {
ctx, cancel := context.WithTimeout(context.Background(), clock.Hour*5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this timeout is waaaay too high. Github actions will not run for more than 10 minutes by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear, this test wasnt ready to be merged as is, hence why it was in draft :) - it was simply to reproduce the culprit behaviour that was reported in the bug.

return gotResp.GetMetadata()["owner"]
}

// Send two hits that should be processed by the owner and the peer and deplete the limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be processed by the owner

are you asserting this anywhere? I don't see it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no in this case im relying on the grpc load balancer we passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can and should check in order to harden the test

Comment on lines +878 to +879
address := fmt.Sprintf("static:///%s,%s", owner, peer)
conn, err := grpc.DialContext(context.Background(), address, dialOpts...)
Copy link
Contributor

@miparnisari miparnisari Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify what this is doing? it's not clear to me. are you creating a grpc connection to the owner or the peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is creating a static pool of servers to communicate with and creating the connection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the connection is to which server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between the owner and the peer

Comment on lines +915 to +916
// sleep to ensure the async forward has occurred and state should be shared
time.Sleep(time.Second * 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (not sure) you can control this via setting globalBatchTimeout = 1 and globalSyncWait = 1ms, so you don't need to wait so long here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, use github.com/mailgun/holster/v4/clock to freeze or advance system time during the test.

@miparnisari
Copy link
Contributor

miparnisari commented Jan 23, 2024

@thrawn01

What are we testing here? GPRC's load balancing or the ability for gubernator to distribute the hits to the rest of the cluster?

I didn't write the test but the goal should be to assert that the gossiping of the rate limiting from owner to peers is correct in that the status field is correct.

Functional Tests > Unit Tests.

I don't know what you meant by this (and i can't access the URL you provided on my work laptop) but if a bug happens in a confined place (which is the case here), you should not have to write a functional test to demonstrate it. The test should be laser-focused on what is broken. Functional tests cover a lot more code.

That being said, I don't know the codebase enough to say whether a unit test makes sense here.

@thrawn01
Copy link
Contributor

@miparnisari re functional testing.

I understand that most developers are taught to write unit tests and thus are surprised when they encounter a project which is mostly functional tests. This is a common issue when we on boarding new developers at Mailgun. Please have confidence when making changes to algorithms.go that those functions are well tested. A coverage run shows 81% of statements in algorithms.go are tested with 1,195 calls during the functional test suite run. I'm not claiming the current testing level is adequate, (we could improve coverage on Behavior_DURATION_IS_GREGORIAN for instance) only that functional testing allows developers to freely change the internals of a system while having high confidence they didn't break functionality.

@Baliedge
Copy link
Contributor

@philipgough Thank you very much for your contribution. I was able to reproduce the error using your test. And since the PR has a merge conflict and it's been some time since created, in a separate branch I updated from master and retried the test to see if recent changes from #219 had addressed your issue.

I found some improvement. Sometimes the test passes:

=== RUN   TestGlobalRateLimitsWithLoadBalancing
--- PASS: TestGlobalRateLimitsWithLoadBalancing (5.00s)
PASS
ok      github.com/mailgun/gubernator/v2        6.074s

Other times I get 2 failed assertions:

=== RUN   TestGlobalRateLimitsWithLoadBalancing
    functional_test.go:1060:
                Error Trace:    /Users/spoulson/src/gubernator2/functional_test.go:1060
                                                        /Users/spoulson/src/gubernator2/functional_test.go:1070
                Error:          Not equal:
                                expected: 0
                                actual  : 1
                Test:           TestGlobalRateLimitsWithLoadBalancing
                Messages:       1
    functional_test.go:1060:
                Error Trace:    /Users/spoulson/src/gubernator2/functional_test.go:1060
                                                        /Users/spoulson/src/gubernator2/functional_test.go:1071
                Error:          Not equal:
                                expected: 0
                                actual  : 1
                Test:           TestGlobalRateLimitsWithLoadBalancing
                Messages:       2
--- FAIL: TestGlobalRateLimitsWithLoadBalancing (5.01s)

But this is better than the original code that failed 5 assertions. We're on the right path.

You can refer to my branch to cherry-pick my commits if needed: master...Baliedge/global-lb

@Baliedge
Copy link
Contributor

Closed in favor of #224.
I applied the edits I mentioned above and some other adjustments. Test passes every time.

@Baliedge Baliedge closed this Feb 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants