Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add middleware for request prioritization #33951

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

bohde
Copy link
Contributor

@bohde bohde commented Mar 20, 2025

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers. It does this via codel, which will perform the following:

  1. Limit the number of in-flight requests to some user defined max
  2. When in-flight requests have reached their max, begin queuing requests, with logged in requests having priority above logged out requests
  3. Once a request has been queued for too long, it has a percentage chance to be rejected based upon how overloaded the entire system is.

When a server experiences more traffic than it can handle, this has the effect of keeping latency low for logged in users, while rejecting just enough requests from logged out users to keep the service from being overloaded.

Below are benchmarks showing a system at 100% capacity and 200% capacity in a few different configurations. The 200% capacity is shown to highlight an extreme case. I used hey to simulate the bot traffic:

hey -z 1m -c 10 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix"

The concurrency of 10 was chosen from experiments where my local server began to experience higher latency.

Results

Method Concurrency p95 latency Successful RPS Requests Dropped
QoS Off 10 0.2960s 44 rps 0%
QoS Off 20 0.5667s 44 rps 0%
QoS On 20 0.4409s 48 rps 10%
QoS On 50% Logged In* 10 0.3891s 33 rps 7%
QoS On 50% Logged Out* 10 2.0388s 13 rps 6%

Logged in users were given the additional parameter -H "Cookie: i_like_gitea=<session>.

Tests with * were run at the same time, representing a workload with mixed logged in & logged out users. Results are separated to show prioritization, and how logged in users experience a 100ms latency increase under load, compared to the 1.6 seconds logged out users see.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 20, 2025
@github-actions github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies docs-update-needed The document needs to be updated synchronously labels Mar 20, 2025
@bohde bohde force-pushed the rb/request-qos branch 3 times, most recently from f3096c1 to 6c499a9 Compare March 20, 2025 18:14
@wxiaoguang
Copy link
Contributor

Results ....

TBH, according to your test result, I do not see it is really useful .......

Copy link
Member

@a1012112796 a1012112796 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at first step, just distinguish between login and not is enough. and looks it can be added for the api router also.

@bohde
Copy link
Contributor Author

bohde commented Mar 21, 2025

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

I think at first step, just distinguish between login and not is enough. and looks it can be added for the api router also.

This was in my initial version we ran, but then users who are attempting to login can still experience lower priority traffic. Since the major source of traffic for public code hosting are scrapers trying to download content, deprioritizing those routes mitigates that concern.

@wxiaoguang
Copy link
Contributor

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

So the original intention is to fight with crawlers/spiders/robots?

@bohde
Copy link
Contributor Author

bohde commented Mar 21, 2025

TBH, according to your test result, I do not see it is really useful .......

We've been running this patch for awhile, and while it takes a bit of tuning to get the right settings for a given server it has alleviated a lot of outage concerns from waves of scrapers that don't respect robots.txt.

So the original intention is to fight with crawlers/spiders/robots?

Yes, I mentioned this in my description

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 21, 2025

Yes, I mentioned this in my description

Sorry, missed that part. The "perform the following" part and "result" part attracted my attention .....

If the intention is to "protect against malicious scrapers", I think the parts := []string{ could be improved, a similar function like ParseGiteaSiteURL could help the route path handling. And I think we need some tests to cover the new middleware's behavior.


Or like https://github.com/go-gitea/gitea/pull/33951/files#r2006668195 said, mark the routes with priority

@wxiaoguang
Copy link
Contributor

I think I have some ideas about how to handle these "route paths" clearly. Could I push some commits into your branch?

@bohde
Copy link
Contributor Author

bohde commented Mar 21, 2025

I think I have some ideas about how to handle these "route paths" clearly. Could I push some commits into your branch?

Could you send a PR to my branch, or sketch it out in another commit? This would prevent merge conflicts on my branch

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 21, 2025

sketch it out in another commit

Sorry, don't quite understand what it exactly means ....

Could you send a PR to my branch

We can use chi router's "RoutePattern", and make code testable.

@wxiaoguang wxiaoguang added the type/feature Completely new functionality. Can only be merged if feature freeze is not active. label Mar 21, 2025
@wxiaoguang wxiaoguang added this to the 1.24.0 milestone Mar 21, 2025
@bohde
Copy link
Contributor Author

bohde commented Apr 7, 2025

update: Close since there is no interest after 2 weeks

Just getting back to this now as I was dealing with other priorities. I could incorporate the above changes, but the near complete rewrite of my PR makes it difficult to isolate the changes you are asking, and makes it incompatible with the production implementation we are running. This makes it difficult to test these changes against the live traffic we are seeing to ensure they still perform well.

This adds a middleware for overload protection, that is intended to help protect against malicious scrapers. It does this by via  [`codel`](https://github.com/bohde/codel), which will perform the following:

1. Limit the number of in-flight requests to some user defined max
2. When in-flight requests have reached their max, begin queuing requests, with logged in requests having priority above logged out requests
3. Once a request has been queued for too long, it has a percentage chance to be rejected based upon how overloaded the entire system is.

When a server experiences more traffic than it  can handle, this has the effect of keeping latency low for logged in users, while rejecting just enough requests from logged out users to keep the service from being overloaded.

Below are benchmarks showing a system at 100% capacity and 200% capacity in a few different configurations. The 200% capacity is shown to highlight an extreme case. I used [hey](https://github.com/rakyll/hey) to simulate the bot traffic:

```
hey -z 1m -c 10 "http://localhost:3000/rowan/demo/issues?state=open&type=all&labels=&milestone=0&project=0&assignee=0&poster=0&q=fix"
```

The concurrency of 10 was chosen from experiments where my local server began to experience higher latency.

Results

| Method | Concurrency |  p95 latency | Successful RPS | Requests Dropped |
|--------|--------|--------|--------|--------|
| QoS Off | 10 | 0.2960s | 44 rps | 0% |
| QoS Off | 20 | 0.5667s | 44 rps | 0%|
| QoS On | 20 |  0.4409s | 48 rps | 10% |
| QoS On 50% Logged In* | 10 | 0.3891s | 33 rps | 7% |
| QoS On 50% Logged Out* | 10  | 2.0388s | 13 rps | 6% |

Logged in users were given the additional parameter ` -H "Cookie: i_like_gitea=<session>`.

Tests with `*` were run at the same time, representing a workload with mixed logged in & logged out users. Results are separated to show prioritization, and how logged in users experience a 100ms latency increase under load, compared to the 1.6 seconds logged out users see.
@bohde bohde force-pushed the rb/request-qos branch 2 times, most recently from 351be38 to 7cbcef3 Compare April 7, 2025 22:57
@bohde
Copy link
Contributor Author

bohde commented Apr 7, 2025

I applied review feedback, and synthesized into something that is easily backportable to 1.23 by avoiding APIs added in main. @wxiaoguang's PR did miss some important endpoints we sometimes see heavy traffic on since it didn't check every possible delimited portion of the subpath against the string in my original implementation. However, because the RoutePattern can be pulled directly in this middleware using @wxiaoguang's method, this can all be simplified to the following priority scheme:

Logged In Repo Context Priority
Yes - High
No Yes Low
No No Default

This allows key flows such as login and explore to still be prioritized well, with only the repo specific endpoints likely to be throttled. In my experience, this catches the vast majority of what scrapers are targeting.

@bohde bohde requested review from wxiaoguang and a1012112796 April 7, 2025 23:18
a1012112796 added a commit to a1012112796/gitea that referenced this pull request Apr 8, 2025
a early draft to try split registerWebRoutes, it's too long
now. maybe this change will be usefull for
go-gitea#1872 ,
go-gitea#33951 ...

Signed-off-by: a1012112796 <[email protected]>
@github-actions github-actions bot added modifies/translation modifies/templates This PR modifies the template files labels Apr 10, 2025
@bohde
Copy link
Contributor Author

bohde commented Apr 10, 2025

Regarding the 503 page: I believe it’s better to show an internal custom 503 page which is better than showing a blank 503 page to anonymous visitors when the site is overloaded. A blank 503 often gives the impression that it comes from a reverse proxy and the site is completely down, which might discourage users from returning.

Thank you for the the feedback! I have implemented a custom 503 page when the client requests HTML in 5078f67. In an effort to keep the response light in some cases, it does fallback to a plain text response if the user doesn't request HTML.

As for authorship, when a pull request includes commits from multiple contributors, a co-author trailer is automatically added upon merging. So we don’t need to worry about proper attribution in that case.

I'm happy to take this to another issue, but my frustration is around new PRs that @wxiaoguang has opened, with rewritten commits from this PR, while not maintaining the commit history and thus my authorship. For example, I feel that #34167 is largely based on this code, and has copy+pasted some parts of it, even if the goals are not the same. Additionally, #34024 seems to also have taken portions of this PR, rewritten it in a new commit, and removed my authorship. For example, the middleware in 3ae9466#diff-e12ddfac3974ba21f1c9c7389e69beccb1503a67f7e12803d7eb52b6fb133bfcR18 has a lot of similarities to the middleware in my original commit f3096c1#diff-ccb5640b0d6352850437df3f2f9b6416a667f4233d5c7b98d5ad2ad871a3fea1R58, including classifying similar endpoints as either "low priority" or "expensive" and checking for long polling in a similar manner. My commit predates @wxiaoguang's by 6 days, and they were aware of it since they reviewed it the same day I submitted my PR. Since #34024 not reference my commit or this PR, I wasn't aware of its existence until just recently.

@wxiaoguang
Copy link
Contributor

I have made a PR for you: bohde#1 , there was just no response in 2 weeks, but users were waiting. TBH I have no interest to waste time on inactive or stale PRs. At least, in history I had to help to improve some of your PRs or propose some following up fixes for your PRs. A PR with "Maintainers are allowed to edit this pull request" should be open to maintainers to edit. I don't understand why you complain this time.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Apr 10, 2025

with rewritten commits from this PR, while not maintaining the commit history and thus my authorship.

That's not a rewritten, I write everyline from scratch. Hopefully you can understand. So there is no commit history.

Again, as I said, your PR (this PR) DIDN'T include any authorship of me, EITHER. So why complain? You just added (force-pushed) some new commits later after you complained.

@wxiaoguang
Copy link
Contributor

And, one more thing, I just found this your new commit: if the client requests HTML, render an HTML 503 Service Unavailable page : 5078f67 , it is almost my "code" to render the template from another PR: if you complain I copied your code, to be honest : have your new commit included my authorship? Could I say or complain that you copied my code?

These are more appropriate for slower hardware
@bohde
Copy link
Contributor Author

bohde commented Apr 11, 2025

Asking @go-gitea/maintainers to look at this. I think it has addressed all feedback, and other potential proposals do not address the same use cases as this. This is a general purpose load shedder that will help prevent a server from crashing in any overload scenario, that prioritizes traffic from logged users and deprioritizes traffic that is likely from a crawler. While it certainly helps a server that experiences significant crawler traffic, it also protects servers in a general denial of service cases, either from heavy usage from logged in users or targeted attacks.

I've been running versions of this a production public server for a couple months now, and it has significantly improved the experience for logged in users.

The conversation has gotten quite long at this point, so I'll do my best to summarize the feedback and the status of it:

Copy link
Member

@denyskon denyskon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through this PR and the conversation, and I think this feature is more than complete enough for the goal it's trying to achieve. Further refinements could be done later, but it's crucial that we can ship this in 1.24, as many servers, including gitea.com, experience problems with high load recently.

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Apr 11, 2025
@bohde
Copy link
Contributor Author

bohde commented Apr 12, 2025

Documentation PR is at https://gitea.com/gitea/docs/pulls/208, and the app.example.ini is updated to match in 6fb0021

@delvh delvh added reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. and removed docs-update-needed The document needs to be updated synchronously labels Apr 13, 2025
@github-actions github-actions bot added the docs-update-needed The document needs to be updated synchronously label Apr 13, 2025
@delvh
Copy link
Member

delvh commented Apr 13, 2025

Since this PR seems to have some discussion but is ready otherwise, let's wait a day to see if anything changes.
If there's none, I'll merge this PR tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-update-needed The document needs to be updated synchronously lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/dependencies modifies/go Pull requests that update Go code modifies/templates This PR modifies the template files modifies/translation reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. type/feature Completely new functionality. Can only be merged if feature freeze is not active.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants