Gateway rate limiting #1942

zkokelj · 2024-06-03T08:22:26Z

Why this change is needed

Due to huge amount of calls from some IPs we want to guard some of our expensive endpoints with rate limiter

What changes were made as part of this PR

Please provide a high level list of the changes made

PR checks pre-merging

Please indicate below by ticking the checkbox that you have read and performed the required
PR checks

PR checks reviewed and performed

tudor-malene

I suggested a slight change in the approach, as it looks like we need to handle a few more cases.

Each request that has the calculateRateLimitScore lambda defined in the ExecConfig,, will append that value to the current score for the user.

We need to make the RateLimiter type a service with a Start and Stop method.
In the start method, there should be an infinite loop with a timer that fires every second (or something configurable), and reduces the score of all users by some amount.

This way, we have the requests adding as they make them, and in the background we reduce

tools/walletextension/ratelimiter/rate_limiter.go

tools/walletextension/rpcapi/blockchain_api.go

tools/walletextension/rpcapi/utils.go

tools/walletextension/ratelimiter/rate_limiter.go

tudor-malene

lgtm
(after avoiding the conversion of the userid to string)

tudor-malene

The logic is clean now. Needs an additional mechanism

tudor-malene · 2024-06-10T08:43:10Z

tools/walletextension/rpcapi/utils.go

+	// The Execution cost is 100 times the execution duration in milliseconds;
+	// we can change how much user can run by changing decay rate
+	executionCost := uint32(executionDuration) * 100
+	w.RateLimiter.UpdateScore(gethcommon.Address(userID), executionCost)


I'd suggest passing the duration into the rate-limiter, so we have all rate-limiting logic in a single place.

tools/walletextension/ratelimiter/rate_limiter.go

tudor-malene · 2024-06-10T08:51:18Z

tools/walletextension/ratelimiter/rate_limiter.go

+
+// UpdateScore updates the score of the user based on the additional score (time taken to process the request)
+func (rl *RateLimiter) UpdateScore(userID common.Address, additionalScore uint32) {
+	rl.mu.Lock()


worth adding the threshold==0 check here as well. Or maybe create an internal method: "isEnabled"

tools/walletextension/ratelimiter/rate_limiter.go

tudor-malene · 2024-07-02T11:17:51Z

tools/walletextension/main/cli.go

+	rateLimitThresholdUsage   = "Rate limit threshold per user. Default: 1000."
+
+	rateLimitDecayName    = "rateLimitDecay"
+	rateLimitDecayDefault = 0.2


If we make this "1", then we keep parity with the milliseconds, and we can reason in units of time

if we make this 1, it means we are decaying for 1ms every 1ms.

yeah. It's easier to reason about it.

yeah, but it means that we allow by default for every user to use 1000ms in 1000ms on average (100%). Rate limits will be hit very rarely and only with concurrent requests.

tudor-malene · 2024-07-02T11:18:03Z

tools/walletextension/main/cli.go

+	rateLimitDecayDefault = 0.2
+	rateLimitDecayUsage   = "Rate limit decay per user. Default: 0.2"
+
+	rateLimitMaxScoreName    = "rateLimitMaxScore"


this parameter is confusing.

removed this parameter

tudor-malene · 2024-07-02T11:19:39Z

tools/walletextension/main/cli.go

@@ -59,6 +59,18 @@ const (
 	storeIncomingTxs        = "storeIncomingTxs"
 	storeIncomingTxsDefault = true
 	storeIncomingTxsUsage   = "Flag to enable storing incoming transactions in the database for debugging purposes. Default: true"
+
+	rateLimitThresholdName    = "rateLimitThreshold"
+	rateLimitThresholdDefault = 1000


If this is 200 and the decay is 1, we allocate 200ms/per second to each user.

If the decay is 1 then every ms one ms is subtracted from the score.
Setting decay to 0.2 means that every second it decays for 200ms.
Setting threshold to 1000 means that users can use 1000ms in a second, but it will decay with the decay rate above. So we are not rate limiting the user if he created a request that takes 250ms and wants to call another request immediately after that. But if he/she keeps doing it rate limiting will kick in.

tudor-malene · 2024-07-02T11:20:35Z

tools/walletextension/ratelimiter/rate_limiter.go

+// decays based on the time since the last request.
+
+type Score struct {
+	lastRequest time.Time


let's add a "concurrentRequests" value here as well, which we increment during Allow and decrement during "Update".

In "Allow" we should also check that the concurrentRequests needs to be below a configured threshold (A small value like 3).

With this mechanism we protect someone from Dos-ing the gateway by triggering a very large number of expensive requests

tudor-malene

I think we can simplify a bit by keeping the responsibilities more compact.

Also, the rate limit: allow/update needs to be added to the: filter_api.GetLogs method, which doesn't use ExecAuthRPC

tudor-malene · 2024-07-03T12:38:46Z

tools/walletextension/main/cli.go

+	rateLimitThresholdUsage   = "Rate limit threshold per user. Default: 1000."
+
+	rateLimitDecayName    = "rateLimitDecay"
+	rateLimitDecayDefault = 0.2


yeah. It's easier to reason about it.

tudor-malene · 2024-07-03T12:46:28Z

tools/walletextension/ratelimiter/rate_limiter.go

+	// Increase the score of the user based on the execution duration of the request
+	// and update the last request time
+	newScore := rl.users[userID].score + executionDuration - scoreDecay
+	rl.users[userID] = Score{lastRequest: now, score: newScore, concurrentRequests: userScore.concurrentRequests - 1}


the -1 needs to be checked. if the concurrentRequests is already 0 it will overflow.

Actually, it can't ever be 0 at this point.
Anyway, worth checking to avoid something weird

should never happen, but added additional check for it

tudor-malene · 2024-07-03T14:10:57Z

tools/walletextension/ratelimiter/rate_limiter.go

+	userScore, exists := rl.users[userID]
+	scoreDecay := uint32(0)
+	// if user exists decay the score based on the time since the last request
+	if exists {


if there is no score for that user, it's a programming error. This method should always be called after the previous one

logging that error now

tudor-malene · 2024-07-03T14:12:41Z

tools/walletextension/ratelimiter/rate_limiter.go

+	// if user exists decay the score based on the time since the last request
+	if exists {
+		// Decay the score based on the time since the last request and the decay rate
+		timeSinceLastRequest := float64(now.Sub(userScore.lastRequest).Milliseconds())


I don't quite understand why we need to decay the score here. It's hard to reason about what's going on, as that responsibility is now in 2 places.

We decay it only here in UpdateScore.
But we need to calculate it also in Allow function to compare it to the threshold is this now what you meant in this comment: #1942 (comment) ?

tudor-malene · 2024-07-03T14:15:16Z

tools/walletextension/ratelimiter/rate_limiter.go

+	// Increase the score of the user based on the execution duration of the request
+	// and update the last request time
+	newScore := rl.users[userID].score + executionDuration - scoreDecay
+	rl.users[userID] = Score{lastRequest: now, score: newScore, concurrentRequests: userScore.concurrentRequests - 1}


let's not change the lastRequest here.

BedrockSquirrel

LGTM to what Tudor requested. My comments are minor things, please ignore any you disagree with!

BedrockSquirrel · 2024-07-08T22:31:01Z

.github/workflows/manual-deploy-obscuro-gateway.yml

+               ${{ vars.DOCKER_BUILD_TAG_GATEWAY }} \
+               -host=0.0.0.0 -port=8080 -portWS=81 -nodeHost=${{ vars.L2_RPC_URL_VALIDATOR }} -verbose=true \
+               -logPath=sys_out -dbType=mariaDB -dbConnectionURL="obscurouser:${{ secrets.OBSCURO_GATEWAY_MARIADB_USER_PWD }}@tcp(obscurogateway-mariadb-${{  github.event.inputs.testnet_type }}.uksouth.cloudapp.azure.com:3306)/ogdb" \
+               -rateLimitThreshold=${{ vars.GATEWAY_RATE_LIMIT_THRESHOLD }} -rateLimitDecay=${{ vars.GATEWAY_RATE_LIMIT_DECAY }} -maxConcurrentRequestsPerUser=${{ vars.GATEWAY_MAX_CONCURRENT_REQUESTS_PER_USER }} '


Looks like this config usage is out of date now since the latest changes (also you probably need to duplicate the config flags onto the manual-upgrade yaml script).

Fixed, thanks 👍

BedrockSquirrel · 2024-07-08T22:35:15Z

tools/walletextension/main/cli.go

@@ -59,6 +59,18 @@ const (
 	storeIncomingTxs        = "storeIncomingTxs"
 	storeIncomingTxsDefault = true
 	storeIncomingTxsUsage   = "Flag to enable storing incoming transactions in the database for debugging purposes. Default: true"
+
+	rateLimitUserComputeTimeName    = "rateLimitUserComputeTime"


For time duration config we usually either use a string and then call time.ParseDuration on it (like BatchInterval), or we include the units in the flag name (like l1RPCTimeoutSecs). Or if you think it's clear enough then maybe just mention millis in the usage description here.

Using duration now, thanks.

BedrockSquirrel · 2024-07-08T22:37:49Z

tools/walletextension/ratelimiter/rate_limiter.go

+}
+
+// UpdateRequest updates the end time of a request interval given its UUID.
+func (rl *RateLimiter) UpdateRequest(userID common.Address, id uuid.UUID) {


I'd maybe call this method like EndRequest() or SetRequestEnd() or something for clarity.

Renamed 👍

BedrockSquirrel · 2024-07-08T22:43:05Z

tools/walletextension/ratelimiter/rate_limiter.go

+			request.End = &now
+			user.CurrentRequests[id] = request
+		} else {
+			log.Printf("Request with ID %s not found for user %s.", id, userID.Hex())


Do we use the builtin log library in wallet extension server? That seems quite convenient to work with.

Logger used

BedrockSquirrel · 2024-07-08T22:46:53Z

tools/walletextension/ratelimiter/rate_limiter.go

+	if user, exists := rl.users[userID]; exists {
+		cutoff := time.Now().Add(-time.Duration(rl.window) * time.Millisecond)
+		for _, interval := range user.CurrentRequests {
+			if interval.End != nil && interval.End.After(cutoff) {


I guess we could include in-flight requests to this sum, like a totalComputeTime += now.Sub(interval.Start) if End is nil maybe?

BedrockSquirrel · 2024-07-08T22:59:58Z

tools/walletextension/ratelimiter/rate_limiter.go

+
+	// Check if user is in limits of rate limiting
+	userComputeTimeForUser := rl.SumComputeTime(userID)
+	if userComputeTimeForUser <= rl.userComputeTime {


This is maybe a bit subjective but I think in Go they usually put the happy path at the end of the method. So like check if userComputeTimeForUser > threshold here for the bad case and have return true, requestUUID be the last line of the method. Feel free to ignore that though, nitpicking haha.

fixed & noted.. will stick with happy path at the end of methods when possible.

BedrockSquirrel · 2024-07-08T23:01:50Z

tools/walletextension/ratelimiter/rate_limiter.go

+	defer rl.mu.Unlock()
+
+	// delete all the requests that have
+	cutoff := time.Now().Add(-time.Duration(rl.window) * time.Millisecond)


It might be nice if rl.window was already a Duration (like I mentioned in the config stuff) for lines like this.

removed not needed conversions to Duration

BedrockSquirrel · 2024-07-08T23:09:37Z

tools/walletextension/ratelimiter/rate_limiter.go

+}
+
+// PruneRequests deletes all requests that have ended before the rate limiter's window.
+func (rl *RateLimiter) PruneRequests() {


Be nice to know if this starts being slow, maybe a log line at the end if it's over some amount like:

timeTaken := time.Since(startTime) if timeTaken > 1 * time.Second { log.Printf("PruneRequests completed in %s", timeTaken) }

Bit worried the contention on the mutex lock could become troublesome. If it does I guess we can add a per-user rw-lock in the User object so pruning doesn't stop new requests on existing users being allowed.

added logs if time taken > 1s.

I am aware of mutex lock problems (created getters, increments and not locking it for too long).
Do you think we should try with this and later (if it becomes a problem) switch to per-user locking?

Yeah I think it's fine for now all your locks looked good to me and might not be an issue at all. But something to consider if we start seeing unexplained slowness at some point.

BedrockSquirrel

lgtm

tudor-malene · 2024-07-09T08:57:52Z

tools/walletextension/rpcapi/utils.go

@@ -149,6 +154,8 @@ func ExecAuthRPC[R any](ctx context.Context, w *Services, cfg *ExecCfg, method s
 		return nil, rpcErr
 	})

+	w.RateLimiter.SetRequestEnd(gethcommon.Address(userID), requestUUID)


one comment. I think this should better be in a defer
(Same for filter_api)

tudor-malene

lgtm

tudor-malene reviewed Jun 3, 2024

View reviewed changes

vercel bot had a problem deploying to Preview June 3, 2024 09:24 Failure

vercel bot had a problem deploying to Preview June 3, 2024 15:19 Failure

tudor-malene reviewed Jun 4, 2024

View reviewed changes

vercel bot had a problem deploying to Preview June 4, 2024 12:27 Failure

zkokelj added 2 commits June 6, 2024 14:44

rate limiting

24d7c74

fix

7ed9dfe

zkokelj force-pushed the ziga/gateway_rate_limiting branch from 1982a94 to 7ed9dfe Compare June 7, 2024 15:46

minor fixes

8083858

zkokelj marked this pull request as ready for review June 10, 2024 07:50

zkokelj added 2 commits June 10, 2024 18:05

remove balance at calls

70ba6d3

make decay more cleaner, better comments

0ffefde

ten-protocol deleted a comment from vercel bot Jun 11, 2024

use Github variables to control rate limiting parameters

1baf55c

zkokelj temporarily deployed to dev-testnet June 12, 2024 08:28 — with GitHub Actions Inactive

zkokelj temporarily deployed to dev-testnet June 12, 2024 09:18 — with GitHub Actions Inactive

zkokelj force-pushed the ziga/gateway_rate_limiting branch from 935ee5a to 1baf55c Compare June 12, 2024 11:12

fix

39864e3

zkokelj temporarily deployed to dev-testnet June 13, 2024 08:23 — with GitHub Actions Inactive

zkokelj temporarily deployed to dev-testnet June 13, 2024 09:50 — with GitHub Actions Inactive

zkokelj temporarily deployed to dev-testnet June 27, 2024 12:21 — with GitHub Actions Inactive

zkokelj added 2 commits July 1, 2024 16:06

log rate limiting stats

24e23d3

Merge branch 'main' into ziga/gateway_rate_limiting

e922edc

tudor-malene requested changes Jul 2, 2024

View reviewed changes

zkokelj added 3 commits July 3, 2024 13:47

code review fixes

c11a59d

Merge branch 'main' into ziga/gateway_rate_limiting

d7a66ff

fix

35c4649

tudor-malene requested changes Jul 3, 2024

View reviewed changes

pr fixes2

2da93f3

zkokelj requested a review from tudor-malene July 3, 2024 16:05

zkokelj added 3 commits July 5, 2024 16:22

update cli parameters

0d47cd2

new rate limiting

1d2943b

fix rate limiter locks

a4c4801

BedrockSquirrel reviewed Jul 8, 2024

View reviewed changes

pr fixes

1ce404e

BedrockSquirrel approved these changes Jul 9, 2024

View reviewed changes

tudor-malene reviewed Jul 9, 2024

View reviewed changes

using defer, changing defaults

561b1be

zkokelj temporarily deployed to dev-testnet July 9, 2024 09:39 — with GitHub Actions Inactive

zkokelj temporarily deployed to dev-testnet July 9, 2024 10:18 — with GitHub Actions Inactive

zkokelj temporarily deployed to dev-testnet July 9, 2024 10:47 — with GitHub Actions Inactive

tudor-malene approved these changes Jul 24, 2024

View reviewed changes

zkokelj merged commit ad61bdb into main Jul 24, 2024
5 checks passed

zkokelj deleted the ziga/gateway_rate_limiting branch July 24, 2024 12:16

Gateway rate limiting #1942

Gateway rate limiting #1942

Conversation

zkokelj commented Jun 3, 2024

Why this change is needed

What changes were made as part of this PR

PR checks pre-merging

tudor-malene left a comment

Choose a reason for hiding this comment

tudor-malene left a comment

Choose a reason for hiding this comment

tudor-malene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zkokelj Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tudor-malene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BedrockSquirrel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tudor-malene left a comment

Choose a reason for hiding this comment

zkokelj Jul 3, 2024 •

edited

Loading