Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilient cache #279

Merged
merged 10 commits into from
Apr 3, 2024
Merged

Resilient cache #279

merged 10 commits into from
Apr 3, 2024

Conversation

alexsnaps
Copy link
Member

@alexsnaps alexsnaps commented Mar 22, 2024

builds on #276
fixes #273

@eguzki
Copy link
Contributor

eguzki commented Mar 25, 2024

TBH this is very hard to follow. I can only say that I am happy to test this whenever it is ready to test.

@alexsnaps alexsnaps force-pushed the resilient_cache branch 3 times, most recently from e4e0e34 to 12a72f4 Compare March 27, 2024 12:53
@alexsnaps
Copy link
Member Author

alexsnaps commented Mar 27, 2024

TBH this is very hard to follow. I can only say that I am happy to test this whenever it is ready to test.

I guess this is the important change here: cb98725
Which builds upon this change the StorageErr that now exposes transient errors as such... mostly the ones triggered because of a timeout awaiting a response from Redis, controlled thru this change: 2f5e6cd

If you want to test this, you can use the limitador server built from this branch using:

redis_cached redis://127.0.0.1:6379 --ratio 1 --response-timeout 1

This will require your redis to always respond in at most 1ms. You can also shut the redis down entirely... I use my limitador-driver to put load on Limitador during these tests and... well you shouldn't notice any difference and always observe single digit millisecond latency to Limitador from the driver.

I've added some logging on a network partition to redis occurring and resolving:

2024-03-27T13:38:11.013521Z ERROR should_rate_limit:check_and_update: limitador::storage::redis::redis_cached: Partition to Redis detected!
2024-03-27T13:38:11.790301Z  WARN limitador::storage::redis::redis_cached: Partition to Redis resolved!

@alexsnaps alexsnaps marked this pull request as ready for review March 27, 2024 13:02
@alexsnaps
Copy link
Member Author

@eguzki if you have that integration test for CachedRedisStorage, running it on top of this branch should make it pass (until proven wrong!)

Comment on lines +235 to +225
2,
100,
6,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be more informative if we extract this to default values in a const (?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are... they actually are inlined from the redis crate, cause I can't reuse theirs ... not public.

@eguzki
Copy link
Contributor

eguzki commented Mar 27, 2024

Looking good codewise.

Before giving my 👍 , I want to run some tests.

@eguzki
Copy link
Contributor

eguzki commented Mar 27, 2024

@eguzki if you have that integration test for CachedRedisStorage, running it on top of this branch should make it pass (until proven wrong!)

Unfortunately did not pass.

failures:

---- test::check_rate_limited_and_update_load_counters_with_async_redis_and_local_cache stdout ----
thread 'test::check_rate_limited_and_update_load_counters_with_async_redis_and_local_cache' panicked at limitador/tests/integration_tests.rs:698:13:
assertion failed: !result.limited
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test::check_rate_limited_and_update_with_async_redis_and_local_cache stdout ----
thread 'test::check_rate_limited_and_update_with_async_redis_and_local_cache' panicked at limitador/tests/integration_tests.rs:657:13:
assertion failed: !rate_limiter.check_rate_limited_and_update(namespace, &values, 1,
                    false).await.unwrap().limited


failures:
    test::check_rate_limited_and_update_load_counters_with_async_redis_and_local_cache
    test::check_rate_limited_and_update_with_async_redis_and_local_cache

test result: FAILED. 196 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 6.39s

error: test failed, to rerun pass `-p limitador --test integration_tests`

I will be looking into this as well

@alexsnaps
Copy link
Member Author

Unfortunately did not pass.

That's probably that math I had missed...

@alexsnaps alexsnaps force-pushed the resilient_cache branch 2 times, most recently from 1d92ae1 to 04116a0 Compare March 28, 2024 16:52
@eguzki
Copy link
Contributor

eguzki commented Mar 28, 2024

Before running partitioning tests, I run simple test of running 1 request per second hitting a counter with a limit of max 5 req every 10 secs. And results are not the expected ones.

Test

  • Checkout this code branch and enter sandbox folder
cd limitador-server/sandbox/
  • Build limitador image limitador-testing for the current branch
make build
  • Run sandbox with redis

The applied configuration is:

  • max_ttl: 30secs
  • ratio: 1
  • flush period: 100ms
diff --git a/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml b/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml
index 6db3a4d..eb56293 100644
--- a/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml
+++ b/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml
@@ -21,11 +21,11 @@ services:
       - /opt/kuadrant/limits/limits.yaml
       - redis_cached
       - --ttl
-      - "3"
+      - "30000"
       - --ratio
-      - "11"
+      - "1"
       - --flush-period
-      - "2"
+      - "100"
       - --max-cached
       - "5000"
       - redis://redis:6379
diff --git a/limitador-server/sandbox/limits.yaml b/limitador-server/sandbox/limits.yaml
index 0f7b90b..4d553b0 100644
--- a/limitador-server/sandbox/limits.yaml
+++ b/limitador-server/sandbox/limits.yaml
@@ -1,13 +1,13 @@
 ---
 - namespace: test_namespace
-  max_value: 10
-  seconds: 60
+  max_value: 5
+  seconds: 10
   conditions:
     - "req.method == 'GET'"
   variables: []
 - namespace: test_namespace
   max_value: 5
-  seconds: 60
+  seconds: 10
   conditions:
     - "req.method == 'POST'"
   variables: []
make deploy-redis

Limitador reports the following configuration as expected

limitador-1  | 2024-03-28T23:23:26.730508Z  INFO limitador_server: Using config: Configuration { limits_file: "/opt/kuadrant/limits/limits.yaml", storage: Redis(RedisStorageConfiguration { url: "redis://redis:6379", cache: Some(RedisStorageCacheConfiguration { flushing_period: 100, max_ttl: 30000, ttl_ratio: 1, max_counters: 5000, response_timeout: 350 }) }), rls_host: "0.0.0.0", rls_port: 8081, http_host: "0.0.0.0", http_port: 8080, limit_name_in_labels: false, tracing_endpoint: "", log_level: Some(LevelFilter::DEBUG), rate_limit_headers: None, grpc_reflection_service: true }
  • Open a new shell and run 1 request per second
make grpcurl
while :; do bin/grpcurl -plaintext -d @ 127.0.0.1:18081 envoy.service.ratelimit.v3.RateLimitService.ShouldRateLimit <<EOM; sleep 1; done
{
    "domain": "test_namespace",
    "hits_addend": 1,
    "descriptors": [
        {
            "entries": [
                {
                    "key": "req.method",
                    "value": "POST"
                }
            ]
        }
    ]
}
EOM

I would expect on every 10 seconds, 5 request being accepted and 5 requests being rejected. Instead I see this

{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}

It could be some missing store increment fixes I added in #283, specifically https://github.com/Kuadrant/limitador/pull/283/files#diff-d91ad49639aa408abab791b66f7d3bca5ea05d4ffff21d6afd7f5034b378774aR159

@alexsnaps alexsnaps force-pushed the resilient_cache branch 2 times, most recently from 9462288 to f52b328 Compare April 2, 2024 18:54
@alexsnaps alexsnaps force-pushed the resilient_cache branch 2 times, most recently from 7e0e86b to 0ef11c6 Compare April 2, 2024 19:21
@alexsnaps
Copy link
Member Author

@eguzki take it for another round, this should be good.
If so, lets review/merge this and iterate from there on adding tests & I'll refactor this TTL et al stuff into a CachedValue struct

@eguzki
Copy link
Contributor

eguzki commented Apr 2, 2024

Now looking good
With the limit

- namespace: test_namespace
  max_value: 5
  seconds: 10
  conditions:
    - "req.method == 'POST'"
  variables: []

The result is as expected, every 10 secs, 5 accepted and 5 rejected

while :; do bin/grpcurl -plaintext -d @ 127.0.0.1:18081 envoy.service.ratelimit.v3.RateLimitService.ShouldRateLimit <<EOM; sleep 1; done
{
    "domain": "test_namespace",
    "hits_addend": 1,
    "descriptors": [
        {
            "entries": [
                {
                    "key": "req.method",
                    "value": "POST"
                }
            ]
        }
    ]
}
EOM
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}

And integration tests for the redis cached datastore passing 🆗

@eguzki
Copy link
Contributor

eguzki commented Apr 2, 2024

I have run the same test multiple times:

  • while sending 1req/sec, kill redis. See what happens and then bring redis back on.

What I see is that when redis is down, limitador starts returning "always" "overallCode": "OK" and limits are not respected. When redis is back on, the responses are again consistent with the limits set.

In one of the test (I could not reproduce), limitador entered "poison" state and it was returning "always"

ERROR:
  Code: Unavailable
  Message: Service unavailable

I do not know what happens when redis is back, because I screwed up the environment when trying to get redis back on, and I could not reproduce the issue.

@alexsnaps
Copy link
Member Author

alexsnaps commented Apr 2, 2024

Damn... seconds aren't milliseconds! 🤦 see e871dbe (tho I'm planning on 🔥 all that code and having it all properly encapsulated in a CachedCounterValue... that would deal with handling all these things: ttls, increments, and that's what the WriteBehind would use, so to discard expired increments - see below)
Should really be fixed now... BUT! You'd still see an "error" when resolving the partition, which is what we actually expected.
i.e. when it reconnects to the Redis, it writes all the increments that happened during the partition, regardless if they still should be or not... Again, that was always broken tho.
So I'd rather have it fixed as a follow up PR as someone already pointed out:

TBH this is very hard to follow.

😉

@alexsnaps
Copy link
Member Author

also, on the poisoning, it means something went wrong while holding a lock… e.g. an .unwrap on an Err that had the thread panic, while that thread held the mutex… See #280 , but I'll give it a pass in this code when refactoring the CachedCounterValue already

Copy link
Contributor

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working like a charm.

when redis is down, requests keep being rate limited according with the configured limits. And when redis is respawned, after some "over rejecting" period (rejecting more requests than expected), the behavior is consistent with the limits. The "over rejecting" must be due to limitador flushing cached counters of past period hits.

@alexsnaps alexsnaps merged commit cce84e4 into main Apr 3, 2024
20 checks passed
@alexsnaps alexsnaps deleted the resilient_cache branch April 3, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Make CachedRedisStorage fault tolerant
3 participants