Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rate limiting #600

Merged
merged 8 commits into from
Apr 1, 2024
Merged

feat: rate limiting #600

merged 8 commits into from
Apr 1, 2024

Conversation

geekbrother
Copy link
Contributor

@geekbrother geekbrother commented Mar 29, 2024

Description

This PR applies shared rate limiting crate from utils-rs to blockchain-api endpoints.
The combination of IP address and the endpoint is used as the identification key for the token bucket.

The following configuration environment variables were added:

  • RPC_PROXY_STORAGE_RATE_LIMITING_CACHE_REDIS_ADDR_READ and RPC_PROXY_STORAGE_RATE_LIMITING_CACHE_REDIS_ADDR_WRITE are Redis endpoints for the L2 cache.
  • RPC_PROXY_RATE_LIMITING_MAX_TOKENS token bucket size.
  • RPC_PROXY_RATE_LIMITING_REFILL_INTERVAL_SEC token refill interval in seconds.
  • RPC_PROXY_RATE_LIMITING_REFILL_RATE token refill rate per interval.

To enable rate-limiting all of the required variables must be set, otherwise, the rate-limiting will be disabled with the warning message at start.

Rate-limited counter metrics were added at e8a5cc4 to the Prometheus and Grafana panel to track how many rate-limited entries the L1 in-memory cache has so we can track how much users we are rate limiting at the moment and history.

Axum middleware was implemented and added to handlers in 3d0e989.

Default rate-limiting configuration in terraform variables in fd1e605:

  • Maximum tokens in bucket: 100,
  • Refill rate: 2,
  • Refill interval: 1 second.

How Has This Been Tested?

  • New integration tests in 9cafe6c:
    • Flood more than max tokens and check is rate limited,
    • Flood less than max tokens to another endpoint and check is NOT rate-limited.

Due Diligence

  • Breaking change
  • Requires a documentation update
  • Requires a e2e/integration test update

@geekbrother geekbrother self-assigned this Mar 29, 2024
@geekbrother geekbrother changed the base branch from master to feat/update_latest_utils_rs March 29, 2024 00:01
src/lib.rs Outdated
Comment on lines 84 to 86
const RATE_LIMIT_MAX_TOKENS: u32 = 300;
const RATE_LIMIT_INTERVAL: Duration = Duration::from_secs(1);
const RATE_LIMIT_REFILL_RATE: u32 = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is 100 req/s per IP? Seems really high. What about 2 req/s with burst of 100?

Plus I think there should be a second rate limit for the project, e.g. 1k/s.

Suggested change
const RATE_LIMIT_MAX_TOKENS: u32 = 300;
const RATE_LIMIT_INTERVAL: Duration = Duration::from_secs(1);
const RATE_LIMIT_REFILL_RATE: u32 = 100;
const RATE_LIMIT_MAX_TOKENS: u32 = 100;
const RATE_LIMIT_INTERVAL: Duration = Duration::from_secs(1);
const RATE_LIMIT_REFILL_RATE: u32 = 2;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, probably it's too high.
I've moved the configuration to env variables in 09d6730 so we don't need to rebuild everything just to change it.
Let's start with 100 burst and 2/sec and then I'll check the metric from e8a5cc4 how many are affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus I think there should be a second rate limit for the project, e.g. 1k/s.

Speaking about project ID limiting I think this should be a follow-up PR. Let's ship this incrementally.

.is_rate_limited(
path.as_str(),
&network::get_forwarded_ip(headers.clone())
.unwrap_or_else(|| connect_info.0.ip())
Copy link
Member

@chris13524 chris13524 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this defaulting to connecting IP is unsafe. If something breaks with the forwarding headers then we should not be using the connecting IP as that's supposed to be the load balancer (and will result in major unavailability due to rate limiting the load balancer). We should at least be logging this as an error, not silently defaulting. And possibly we should not use a default at all and disable the rate limit or some other type of disable.

In Notify Server for the analytics if the forwarded header wasn't available I didn't lookup the connecting IP and instead recorded empty geo location values and emitted an error!()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. This is a good suggestion! If we are skipping the connect IP in case of absence of X-Forwarded we will break the local testing (since we don't have X-Forwarded in local tests).
My assumption is that if the Load Balancer doesn't report forwarded IP - the LB is broken and that will break a lot not only rate-limiting. I totally agree that we should log an error here in case we don't have a Forwarded IP, so we can fast and easily spot the issue in case we have LB but not have a forwarded IP.
I've added an error log in case we don't have a forwarded IP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with local testing? Can't you set an X-Forwarded header?

state.clone(),
headers.clone(),
connect_info,
path.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the HTTP request path as the key is a bit naive and I would prefer an enum. For e.g. proxy there are a number of endpoints that resolve to the same handler. Not sure for history we might want 1 rate limit counter for the whole history API but not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!
The bucket key is a composite key for IP and matched path. The idea is that if there is a legitimate user he shouldn't exceed our max tokens anyway. So we can leave only IP-based rate-limiting btw.
If there are malicious floods and we add an exact handler path instead of a matched path we can give an additional way to flood us. A matched path was added to the key composition for example for testing facilities, but not to give users making requests max tokens * on each path * path parameters.
So my point of view we should be ok with that and no legit users should be affected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I assumed this was matched path (which is what it was). Still my suggestion was to not use the path as the rate limit and prefer something more explicit. E.g. there's both /v1 and /v1/ matched paths would be the proxy handler, and in the future we might want to group them. But no big deal right now, can always refactor later if needed.

But we should be aware that they can get 2x the rate limit for the proxy handler than we might expect them to be able to.

let app = app.with_state(state_arc.clone());

info!("v{}", build_version);
info!("Running RPC Proxy on port {}", port);
info!("Running Blockchain-API server on port {}", port);
let addr: SocketAddr = format!("{host}:{port}")
.parse()
.expect("Invalid socket address");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be handled as an error instead of expecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! This is an initial bootstrap and if the server can't bind to the listening address we must stop execution with the fatal error. So it's expected here.

let redis_pool = Arc::new(
deadpool_redis::Config::from_url(redis_addr)
.create_pool(Some(deadpool_redis::Runtime::Tokio1))
.expect("Failed to create redis pool for rate limiting"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this error be handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a bootstrapping and shouldn't hurt it's better to skip the rate-limiting and start the server in case the Redis address is wrong in the configuration. Updated in 7ffe453 thanks!

.time_to_live(
interval
.to_std()
.expect("Failed to convert duration for rate limiting memory cache"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this error be handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a conversion from std Duration and chrono's, because Moka uses a different Duration. If this happens something is really wrong with the app and probably we should expect here.

Copy link

@nopestack nopestack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed some expects/unwrap at what appears to be boot time ops which is not egregious but I'd prefer the errors to be handled if it makes sense. All ok apart from that

@geekbrother geekbrother merged commit e0a9fe2 into master Apr 1, 2024
16 checks passed
@geekbrother geekbrother deleted the feat/rate_limiting branch April 1, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants