Resilient cache #279

alexsnaps · 2024-03-22T16:30:05Z

builds on #276
fixes #273

eguzki · 2024-03-25T19:01:52Z

TBH this is very hard to follow. I can only say that I am happy to test this whenever it is ready to test.

alexsnaps · 2024-03-27T13:01:30Z

TBH this is very hard to follow. I can only say that I am happy to test this whenever it is ready to test.

I guess this is the important change here: cb98725
Which builds upon this change the StorageErr that now exposes transient errors as such... mostly the ones triggered because of a timeout awaiting a response from Redis, controlled thru this change: 2f5e6cd

If you want to test this, you can use the limitador server built from this branch using:

redis_cached redis://127.0.0.1:6379 --ratio 1 --response-timeout 1

This will require your redis to always respond in at most 1ms. You can also shut the redis down entirely... I use my limitador-driver to put load on Limitador during these tests and... well you shouldn't notice any difference and always observe single digit millisecond latency to Limitador from the driver.

I've added some logging on a network partition to redis occurring and resolving:

2024-03-27T13:38:11.013521Z ERROR should_rate_limit:check_and_update: limitador::storage::redis::redis_cached: Partition to Redis detected!
2024-03-27T13:38:11.790301Z  WARN limitador::storage::redis::redis_cached: Partition to Redis resolved!

alexsnaps · 2024-03-27T14:41:32Z

@eguzki if you have that integration test for CachedRedisStorage, running it on top of this branch should make it pass (until proven wrong!)

limitador/src/storage/redis/redis_cached.rs

didierofrivia · 2024-03-27T16:44:27Z

limitador/src/storage/redis/redis_cached.rs

+            2,
+            100,
+            6,


Maybe it could be more informative if we extract this to default values in a const (?)

They are... they actually are inlined from the redis crate, cause I can't reuse theirs ... not public.

limitador/src/storage/redis/redis_cached.rs

eguzki · 2024-03-27T16:59:05Z

Looking good codewise.

Before giving my 👍 , I want to run some tests.

eguzki · 2024-03-27T17:02:38Z

@eguzki if you have that integration test for CachedRedisStorage, running it on top of this branch should make it pass (until proven wrong!)

Unfortunately did not pass.

failures:

---- test::check_rate_limited_and_update_load_counters_with_async_redis_and_local_cache stdout ----
thread 'test::check_rate_limited_and_update_load_counters_with_async_redis_and_local_cache' panicked at limitador/tests/integration_tests.rs:698:13:
assertion failed: !result.limited
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test::check_rate_limited_and_update_with_async_redis_and_local_cache stdout ----
thread 'test::check_rate_limited_and_update_with_async_redis_and_local_cache' panicked at limitador/tests/integration_tests.rs:657:13:
assertion failed: !rate_limiter.check_rate_limited_and_update(namespace, &values, 1,
                    false).await.unwrap().limited


failures:
    test::check_rate_limited_and_update_load_counters_with_async_redis_and_local_cache
    test::check_rate_limited_and_update_with_async_redis_and_local_cache

test result: FAILED. 196 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 6.39s

error: test failed, to rerun pass `-p limitador --test integration_tests`

I will be looking into this as well

alexsnaps · 2024-03-27T19:22:19Z

Unfortunately did not pass.

That's probably that math I had missed...

eguzki · 2024-03-28T23:27:39Z

Before running partitioning tests, I run simple test of running 1 request per second hitting a counter with a limit of max 5 req every 10 secs. And results are not the expected ones.

Test

Checkout this code branch and enter sandbox folder

cd limitador-server/sandbox/

Build limitador image limitador-testing for the current branch

make build

Run sandbox with redis

The applied configuration is:

max_ttl: 30secs
ratio: 1
flush period: 100ms

diff --git a/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml b/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml
index 6db3a4d..eb56293 100644
--- a/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml
+++ b/limitador-server/sandbox/docker-compose-limitador-redis-cached.yaml
@@ -21,11 +21,11 @@ services:
       - /opt/kuadrant/limits/limits.yaml
       - redis_cached
       - --ttl
-      - "3"
+      - "30000"
       - --ratio
-      - "11"
+      - "1"
       - --flush-period
-      - "2"
+      - "100"
       - --max-cached
       - "5000"
       - redis://redis:6379
diff --git a/limitador-server/sandbox/limits.yaml b/limitador-server/sandbox/limits.yaml
index 0f7b90b..4d553b0 100644
--- a/limitador-server/sandbox/limits.yaml
+++ b/limitador-server/sandbox/limits.yaml
@@ -1,13 +1,13 @@
 ---
 - namespace: test_namespace
-  max_value: 10
-  seconds: 60
+  max_value: 5
+  seconds: 10
   conditions:
     - "req.method == 'GET'"
   variables: []
 - namespace: test_namespace
   max_value: 5
-  seconds: 60
+  seconds: 10
   conditions:
     - "req.method == 'POST'"
   variables: []

make deploy-redis

Limitador reports the following configuration as expected

limitador-1  | 2024-03-28T23:23:26.730508Z  INFO limitador_server: Using config: Configuration { limits_file: "/opt/kuadrant/limits/limits.yaml", storage: Redis(RedisStorageConfiguration { url: "redis://redis:6379", cache: Some(RedisStorageCacheConfiguration { flushing_period: 100, max_ttl: 30000, ttl_ratio: 1, max_counters: 5000, response_timeout: 350 }) }), rls_host: "0.0.0.0", rls_port: 8081, http_host: "0.0.0.0", http_port: 8080, limit_name_in_labels: false, tracing_endpoint: "", log_level: Some(LevelFilter::DEBUG), rate_limit_headers: None, grpc_reflection_service: true }

Open a new shell and run 1 request per second

make grpcurl

while :; do bin/grpcurl -plaintext -d @ 127.0.0.1:18081 envoy.service.ratelimit.v3.RateLimitService.ShouldRateLimit <<EOM; sleep 1; done
{
    "domain": "test_namespace",
    "hits_addend": 1,
    "descriptors": [
        {
            "entries": [
                {
                    "key": "req.method",
                    "value": "POST"
                }
            ]
        }
    ]
}
EOM

I would expect on every 10 seconds, 5 request being accepted and 5 requests being rejected. Instead I see this

{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}

It could be some missing store increment fixes I added in #283, specifically https://github.com/Kuadrant/limitador/pull/283/files#diff-d91ad49639aa408abab791b66f7d3bca5ea05d4ffff21d6afd7f5034b378774aR159

alexsnaps · 2024-04-02T19:39:09Z

@eguzki take it for another round, this should be good.
If so, lets review/merge this and iterate from there on adding tests & I'll refactor this TTL et al stuff into a CachedValue struct

eguzki · 2024-04-02T20:08:47Z

Now looking good
With the limit

- namespace: test_namespace
  max_value: 5
  seconds: 10
  conditions:
    - "req.method == 'POST'"
  variables: []

The result is as expected, every 10 secs, 5 accepted and 5 rejected

while :; do bin/grpcurl -plaintext -d @ 127.0.0.1:18081 envoy.service.ratelimit.v3.RateLimitService.ShouldRateLimit <<EOM; sleep 1; done
{
    "domain": "test_namespace",
    "hits_addend": 1,
    "descriptors": [
        {
            "entries": [
                {
                    "key": "req.method",
                    "value": "POST"
                }
            ]
        }
    ]
}
EOM
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OVER_LIMIT"
}
{
  "overallCode": "OK"
}
{
  "overallCode": "OK"
}

And integration tests for the redis cached datastore passing 🆗

eguzki · 2024-04-02T20:46:23Z

I have run the same test multiple times:

while sending 1req/sec, kill redis. See what happens and then bring redis back on.

What I see is that when redis is down, limitador starts returning "always" "overallCode": "OK" and limits are not respected. When redis is back on, the responses are again consistent with the limits set.

In one of the test (I could not reproduce), limitador entered "poison" state and it was returning "always"

ERROR:
  Code: Unavailable
  Message: Service unavailable

I do not know what happens when redis is back, because I screwed up the environment when trying to get redis back on, and I could not reproduce the issue.

alexsnaps · 2024-04-02T21:33:13Z

Damn... seconds aren't milliseconds! 🤦 see e871dbe (tho I'm planning on 🔥 all that code and having it all properly encapsulated in a CachedCounterValue... that would deal with handling all these things: ttls, increments, and that's what the WriteBehind would use, so to discard expired increments - see below)
Should really be fixed now... BUT! You'd still see an "error" when resolving the partition, which is what we actually expected.
i.e. when it reconnects to the Redis, it writes all the increments that happened during the partition, regardless if they still should be or not... Again, that was always broken tho.
So I'd rather have it fixed as a follow up PR as someone already pointed out:

TBH this is very hard to follow.

😉

…the limit's TTL for now

alexsnaps · 2024-04-03T15:04:12Z

also, on the poisoning, it means something went wrong while holding a lock… e.g. an .unwrap on an Err that had the thread panic, while that thread held the mutex… See #280 , but I'll give it a pass in this code when refactoring the CachedCounterValue already

eguzki

working like a charm.

when redis is down, requests keep being rate limited according with the configured limits. And when redis is respawned, after some "over rejecting" period (rejecting more requests than expected), the behavior is consistent with the limits. The "over rejecting" must be due to limitador flushing cached counters of past period hits.

alexsnaps mentioned this pull request Mar 22, 2024

Partion resilience #268

Closed

alexsnaps force-pushed the resilient_cache branch from 2c4aa81 to 08a3be2 Compare March 25, 2024 14:12

alexsnaps force-pushed the resilient_cache branch 3 times, most recently from e4e0e34 to 12a72f4 Compare March 27, 2024 12:53

alexsnaps marked this pull request as ready for review March 27, 2024 13:02

alexsnaps requested review from eguzki, didierofrivia and adam-cattermole March 27, 2024 14:45

didierofrivia reviewed Mar 27, 2024

View reviewed changes

limitador/src/storage/redis/redis_cached.rs Outdated Show resolved Hide resolved

didierofrivia reviewed Mar 27, 2024

View reviewed changes

limitador/src/storage/redis/redis_cached.rs Outdated Show resolved Hide resolved

eguzki reviewed Mar 27, 2024

View reviewed changes

limitador/src/storage/redis/redis_cached.rs Outdated Show resolved Hide resolved

alexsnaps force-pushed the resilient_cache branch 2 times, most recently from 1d92ae1 to 04116a0 Compare March 28, 2024 16:52

alexsnaps force-pushed the resilient_cache branch 2 times, most recently from 9462288 to f52b328 Compare April 2, 2024 18:54

alexsnaps added 4 commits April 2, 2024 15:10

Moved err to proper place

71bd1fd

Added transient error info to StorageErr

9fbf35a

Missing StorageErr From impl updates

05effcb

Move all redis calls behind a dedicated Result<_, StorageErr> method

278b1d0

alexsnaps force-pushed the resilient_cache branch 2 times, most recently from 7e0e86b to 0ef11c6 Compare April 2, 2024 19:21

alexsnaps added 5 commits April 2, 2024 17:30

Basic partition tolerance

efebaea

Added switch to configure response timout on Redis

19c2ac2

Use increments in Redis

f08c2c4

Added logging on partition

f2e1587

Always use the batch writes to redis

1afbaf8

alexsnaps force-pushed the resilient_cache branch from 0ef11c6 to 1afbaf8 Compare April 2, 2024 21:30

Probably should ignore the ratio when falling back, and actually use …

e871dbe

…the limit's TTL for now

eguzki approved these changes Apr 3, 2024

View reviewed changes

alexsnaps merged commit cce84e4 into main Apr 3, 2024
20 checks passed

alexsnaps deleted the resilient_cache branch April 3, 2024 17:17

eguzki mentioned this pull request Apr 4, 2024

redis cached: response timeout Kuadrant/limitador-operator#128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resilient cache #279

Resilient cache #279

alexsnaps commented Mar 22, 2024 •

edited

Loading

eguzki commented Mar 25, 2024

alexsnaps commented Mar 27, 2024 •

edited

Loading

alexsnaps commented Mar 27, 2024

didierofrivia Mar 27, 2024

alexsnaps Mar 27, 2024

eguzki commented Mar 27, 2024

eguzki commented Mar 27, 2024

alexsnaps commented Mar 27, 2024

eguzki commented Mar 28, 2024

alexsnaps commented Apr 2, 2024

eguzki commented Apr 2, 2024

eguzki commented Apr 2, 2024

alexsnaps commented Apr 2, 2024 •

edited

Loading

alexsnaps commented Apr 3, 2024

eguzki left a comment

+,
+,
+,

Resilient cache #279

Resilient cache #279

Conversation

alexsnaps commented Mar 22, 2024 • edited Loading

eguzki commented Mar 25, 2024

alexsnaps commented Mar 27, 2024 • edited Loading

alexsnaps commented Mar 27, 2024

didierofrivia Mar 27, 2024

Choose a reason for hiding this comment

alexsnaps Mar 27, 2024

Choose a reason for hiding this comment

eguzki commented Mar 27, 2024

eguzki commented Mar 27, 2024

alexsnaps commented Mar 27, 2024

eguzki commented Mar 28, 2024

Test

alexsnaps commented Apr 2, 2024

eguzki commented Apr 2, 2024

eguzki commented Apr 2, 2024

alexsnaps commented Apr 2, 2024 • edited Loading

alexsnaps commented Apr 3, 2024

eguzki left a comment

Choose a reason for hiding this comment

alexsnaps commented Mar 22, 2024 •

edited

Loading

alexsnaps commented Mar 27, 2024 •

edited

Loading

alexsnaps commented Apr 2, 2024 •

edited

Loading