Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

wxiaoguang · 2025-04-10T02:28:55Z

To help public sites like gitea.com to protect from crawlers:

Limit the max anonymous requests inflight
When a new anonymous comes and it exceeds the limit:
- Generate a rate-limit token, then respond a 503 page.
- Crawlers will see this 503 page and queue the request by themselves, so crawlers could still be able to read whole site.
- End users will see a prompt page: "Wait for a few seconds, or sign in", the request is queued by browser JS, then they could still access the pages (see the screenshot)

To debug (manually at the moment): set OVERLOAD_INFLIGHT_ANONYMOUS_REQUESTS=0

This is just a quick PR, some values are hard-coded, if this PR looks good, they could also become config options later or in a new PR.

silverwind · 2025-04-11T08:22:30Z

templates/status/503_ratelimit.tmpl

+	{{if .IsRepo}}{{template "repo/header" .}}{{end}}
+	<div class="ui container">
+		{{/*TODO: this page could be improved*/}}
+		Server is busy, loading .... or <a href="{{AppSubUrl}}/user/login">click here to sign in</a>.


Should translate

silverwind · 2025-04-11T08:23:02Z

custom/conf/app.example.ini

@@ -780,6 +780,7 @@ LEVEL = Info
 ;; for example: block anonymous AI crawlers from accessing repo code pages.
 ;; The "expensive" mode is experimental and subject to change.
 ;REQUIRE_SIGNIN_VIEW = false
+;OVERLOAD_INFLIGHT_ANONYMOUS_REQUESTS =


Don't think we need to document debug-only options, at least not like that. Maybe we should add a [debug] ini section?

silverwind · 2025-04-11T08:24:45Z

As mentioned in the other PR, I prefer status code 429 over 503 for per-client rate-limited responses. Ideally including a Retry-After header in the response. 503 is fine for general server-load related responses though.

wxiaoguang · 2025-04-11T10:24:52Z

To answer all these comments: this PR is a demo for one of my proposed "anti-crawler actions" in #33951 (comment) , that option is not for "debug-purpose only".

But there are arguments in #33951 . God knows what would happen 🤷

bohde

I think this approach is very interesting, and could complement #33951. One assumption this makes however, is that if the client runs Javascript, it is not a crawler, and therefore does not need to be throttled. This doesn't protect in scenarios where an overload is due to a page drawing high amounts of traffic from humans operating browsers, but it also doesn't protect in the case of bots using frameworks such as playwright to crawl.

It's difficult to distinguish between humans and bots in this scenario, since they have are using real browser user agents, but I did run a quick experiment on our production traffic. I deployed a piece of javascript that will ping a beacon endpoint on the Gitea instance with the URL of the page it is on, which is then logged. I then looked through these logs with common endpoints I see crawlers target. I was able to find traffic from data center IPs in the logs for these beacons. In zooming in on a particular IP address, it did appear to be crawling links in an iterative fashion. I think this demonstrates that there are bots out there that are executing javascript when crawling Gitea instances.

I think that a slight shift in strategy to work with the QoS middleware would make this approach more resilient. This could change the algorithm from "if you execute javascript, you are able to access unrestricted" to "if you don't execute javascript, you are more likely to be restricted". This also has the benefit of continuing to protect the server in cases where overload is due to humans generating traffic.

bohde · 2025-04-13T17:44:03Z

routers/common/blockexpensive.go

+		const tokenCookieName = "gitea_arlt" // gitea anonymous rate limit token
+		cookieToken, _ := req.Cookie(tokenCookieName)
+		if cookieToken != nil && cookieToken.Value != "" {
+			token, exist := tokenCache.Get(cookieToken.Value)


issue: this doesn't work when running multiple Gitea processes, perhaps to load balance or enable graceful restarts. Since this code is meant for instances that serve substantial public traffic, they are more likely to be running a multi process setup.

suggestion: leverage a signed token instead, perhaps a JWT. This prevents needing some sort of external store for these tokens. However, this prevents easy revocation of the token, so it should likely have an expiration as well. Because of this, we should also add the request's IP address to this JWT, and check that the IP address in the token is the same as the request's. This is because large, malicious crawler often distribute workloads over many IPs to get around IP-based rate limits.

There are far more problems to run Gitea in a cluster, see my comments in other issues (#13791)

For this problem: there must be Redis in a cluster.

bohde · 2025-04-13T17:47:48Z

templates/status/503_ratelimit.tmpl

+		<script>
+			setTimeout(() => {
+				document.cookie = "{{.RateLimitCookieName}}={{.RateLimitTokenKey}}; path=/";
+				window.location.reload();


issue: in the event of an overload where the client executes javascript, this can still cause an outage. This could be a result of something such as a repo being on the frontpage of a new aggregator, drawing a lot of human users, or one of the many crawler frameworks that use headless browsers to automate crawling.

suggestion: move this logic to the QoS middleware instead, where this logic only activates on a low priority request being rejected. In this case, change the priority to default in order, which gives them a higher chance of getting through when non javascript executing clients are crawling the site, but still provide a good experience for logged in users in the event of a javascript executing crawler having all of its requests classified as default

suggestion(non-blocking): to further prioritize users, we could leverage the existing captcha support (if the user has configured it) to further segment out bots from humans. However, this might be quite a large addition.

Yes, I also designed that approach, but I didn't integrate the QoS because its your authorship.

bohde · 2025-04-13T17:51:50Z

templates/status/503_ratelimit.tmpl

+		Server is busy, loading .... or <a href="{{AppSubUrl}}/user/login">click here to sign in</a>.
+		<script>
+			setTimeout(() => {
+				document.cookie = "{{.RateLimitCookieName}}={{.RateLimitTokenKey}}; path=/";


issue: the path on this cookie can cause problems when setting.AppSubURL is not a default value

Yes, some details still need to be improved, including the i18n and cookie management.

bohde · 2025-04-14T15:32:07Z

routers/common/blockexpensive.go

+		ctxData := middleware.GetContextData(req.Context())
+
+		tokenKey, _ := util.CryptoRandomString(32)
+		retryAfterDuration := 1 * time.Second


issue: since this causes the client to retry, there is a possibility that a lot of clients are told to retry after a given time, and those retries all happen at approximately the same time, causing an outage.

suggestion: add some random jitter to this value so that clients don't all retry at the same time.

No need. Client time already random.

wxiaoguang · 2025-04-14T23:36:08Z

but it also doesn't protect in the case of bots using frameworks such as playwright to crawl.

It is still rate-limited, and could be fine-tuned, and still better than a pure 503 page.

I think that a slight shift in strategy to work with the QoS middleware would make this approach more resilient.

That's also my plan, but I didn't integrate the QoS in this demo because its your authorship.

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Apr 10, 2025

pull-request-size bot added the size/M label Apr 10, 2025

github-actions bot added modifies/go Pull requests that update Go code modifies/templates This PR modifies the template files docs-update-needed The document needs to be updated synchronously labels Apr 10, 2025

wxiaoguang mentioned this pull request Apr 10, 2025

add middleware for request prioritization #33951

Merged

wxiaoguang force-pushed the rate-limit-anonymous-access branch 5 times, most recently from ef82ced to fea3640 Compare April 10, 2025 02:42

fix

1e7ed68

wxiaoguang force-pushed the rate-limit-anonymous-access branch from fea3640 to 1e7ed68 Compare April 10, 2025 02:44

wxiaoguang mentioned this pull request Apr 10, 2025

Setting to disable expensive endpoints for anonymous users #33966

Closed

techknowlogick removed the size/M label Apr 10, 2025

silverwind reviewed Apr 11, 2025

View reviewed changes

wxiaoguang marked this pull request as draft April 13, 2025 08:02

bohde reviewed Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

wxiaoguang commented Apr 10, 2025 •

edited

Loading

silverwind Apr 11, 2025

silverwind Apr 11, 2025 •

edited

Loading

silverwind commented Apr 11, 2025 •

edited

Loading

wxiaoguang commented Apr 11, 2025

bohde left a comment

bohde Apr 13, 2025

wxiaoguang Apr 14, 2025 •

edited

Loading

bohde Apr 13, 2025

wxiaoguang Apr 14, 2025

bohde Apr 13, 2025

wxiaoguang Apr 14, 2025

bohde Apr 14, 2025

wxiaoguang Apr 14, 2025

wxiaoguang commented Apr 14, 2025

Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

Are you sure you want to change the base?

Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

Conversation

wxiaoguang commented Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

silverwind Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

silverwind commented Apr 11, 2025 • edited Loading

wxiaoguang commented Apr 11, 2025

bohde left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxiaoguang Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxiaoguang commented Apr 14, 2025

wxiaoguang commented Apr 10, 2025 •

edited

Loading

silverwind Apr 11, 2025 •

edited

Loading

silverwind commented Apr 11, 2025 •

edited

Loading

wxiaoguang Apr 14, 2025 •

edited

Loading