-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ef82ced
to
fea3640
Compare
fea3640
to
1e7ed68
Compare
{{if .IsRepo}}{{template "repo/header" .}}{{end}} | ||
<div class="ui container"> | ||
{{/*TODO: this page could be improved*/}} | ||
Server is busy, loading .... or <a href="{{AppSubUrl}}/user/login">click here to sign in</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should translate
@@ -780,6 +780,7 @@ LEVEL = Info | |||
;; for example: block anonymous AI crawlers from accessing repo code pages. | |||
;; The "expensive" mode is experimental and subject to change. | |||
;REQUIRE_SIGNIN_VIEW = false | |||
;OVERLOAD_INFLIGHT_ANONYMOUS_REQUESTS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need to document debug-only options, at least not like that. Maybe we should add a [debug]
ini section?
As mentioned in the other PR, I prefer status code 429 over 503 for per-client rate-limited responses. Ideally including a |
To answer all these comments: this PR is a demo for one of my proposed "anti-crawler actions" in #33951 (comment) , that option is not for "debug-purpose only". But there are arguments in #33951 . God knows what would happen 🤷 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is very interesting, and could complement #33951. One assumption this makes however, is that if the client runs Javascript, it is not a crawler, and therefore does not need to be throttled. This doesn't protect in scenarios where an overload is due to a page drawing high amounts of traffic from humans operating browsers, but it also doesn't protect in the case of bots using frameworks such as playwright to crawl.
It's difficult to distinguish between humans and bots in this scenario, since they have are using real browser user agents, but I did run a quick experiment on our production traffic. I deployed a piece of javascript that will ping a beacon endpoint on the Gitea instance with the URL of the page it is on, which is then logged. I then looked through these logs with common endpoints I see crawlers target. I was able to find traffic from data center IPs in the logs for these beacons. In zooming in on a particular IP address, it did appear to be crawling links in an iterative fashion. I think this demonstrates that there are bots out there that are executing javascript when crawling Gitea instances.
I think that a slight shift in strategy to work with the QoS middleware would make this approach more resilient. This could change the algorithm from "if you execute javascript, you are able to access unrestricted" to "if you don't execute javascript, you are more likely to be restricted". This also has the benefit of continuing to protect the server in cases where overload is due to humans generating traffic.
const tokenCookieName = "gitea_arlt" // gitea anonymous rate limit token | ||
cookieToken, _ := req.Cookie(tokenCookieName) | ||
if cookieToken != nil && cookieToken.Value != "" { | ||
token, exist := tokenCache.Get(cookieToken.Value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: this doesn't work when running multiple Gitea processes, perhaps to load balance or enable graceful restarts. Since this code is meant for instances that serve substantial public traffic, they are more likely to be running a multi process setup.
suggestion: leverage a signed token instead, perhaps a JWT. This prevents needing some sort of external store for these tokens. However, this prevents easy revocation of the token, so it should likely have an expiration as well. Because of this, we should also add the request's IP address to this JWT, and check that the IP address in the token is the same as the request's. This is because large, malicious crawler often distribute workloads over many IPs to get around IP-based rate limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are far more problems to run Gitea in a cluster, see my comments in other issues (#13791)
For this problem: there must be Redis in a cluster.
<script> | ||
setTimeout(() => { | ||
document.cookie = "{{.RateLimitCookieName}}={{.RateLimitTokenKey}}; path=/"; | ||
window.location.reload(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: in the event of an overload where the client executes javascript, this can still cause an outage. This could be a result of something such as a repo being on the frontpage of a new aggregator, drawing a lot of human users, or one of the many crawler frameworks that use headless browsers to automate crawling.
suggestion: move this logic to the QoS middleware instead, where this logic only activates on a low
priority request being rejected. In this case, change the priority to default
in order, which gives them a higher chance of getting through when non javascript executing clients are crawling the site, but still provide a good experience for logged in users in the event of a javascript executing crawler having all of its requests classified as default
suggestion(non-blocking): to further prioritize users, we could leverage the existing captcha support (if the user has configured it) to further segment out bots from humans. However, this might be quite a large addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also designed that approach, but I didn't integrate the QoS because its your authorship.
Server is busy, loading .... or <a href="{{AppSubUrl}}/user/login">click here to sign in</a>. | ||
<script> | ||
setTimeout(() => { | ||
document.cookie = "{{.RateLimitCookieName}}={{.RateLimitTokenKey}}; path=/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: the path on this cookie can cause problems when setting.AppSubURL
is not a default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some details still need to be improved, including the i18n and cookie management.
ctxData := middleware.GetContextData(req.Context()) | ||
|
||
tokenKey, _ := util.CryptoRandomString(32) | ||
retryAfterDuration := 1 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: since this causes the client to retry, there is a possibility that a lot of clients are told to retry after a given time, and those retries all happen at approximately the same time, causing an outage.
suggestion: add some random jitter to this value so that clients don't all retry at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need. Client time already random.
It is still rate-limited, and could be fine-tuned, and still better than a pure 503 page.
That's also my plan, but I didn't integrate the QoS in this demo because its your authorship. |
To help public sites like gitea.com to protect from crawlers:
To debug (manually at the moment): set
OVERLOAD_INFLIGHT_ANONYMOUS_REQUESTS=0
This is just a quick PR, some values are hard-coded, if this PR looks good, they could also become config options later or in a new PR.