-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sdk): Introduce event_cache::Deduplicator
#4174
feat(sdk): Introduce event_cache::Deduplicator
#4174
Conversation
8c41d63
to
f621c8c
Compare
f621c8c
to
8815d76
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4174 +/- ##
==========================================
- Coverage 84.85% 84.83% -0.03%
==========================================
Files 270 271 +1
Lines 29031 29097 +66
==========================================
+ Hits 24634 24684 +50
- Misses 4397 4413 +16 ☔ View full report in Codecov by Sentry. |
8815d76
to
a453080
Compare
Rebased on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, thanks! Not sure if we need the Bloom filter though, would like to see some data backing that claim :)
pub fn new() -> Self { | ||
Self { | ||
bloom_filter: Mutex::new( | ||
GrowableBloomBuilder::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be the knight who says YAGNI, but… is it so much faster, considering we do have a btreeset check in both branches after the check_and_set
below (equivalent to inserting into the btreeset)? Do you have data supporting this (a benchmark)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BTreeSet
is used only to check the duplicates in the new events that we are checking. The bloom filter is to check the duplicates in past events. For example:
- The bloom filter contains
[]
. I check["$ev0", "$ev1", "$ev2]
. No duplicates. - The bloom filter now contains
["$ev0", "$ev1", "$ev2"]
. I check["$ev3"]
. No duplicates. - The bloom filter now contains
["$ev0", "$ev1", "$ev2", "$ev3"]
. I check["$ev0"]
. There is a duplicate. - The bloom filter now contains
["$ev0", "$ev1", "$ev2", "$ev3"]
. I check["$ev4", "$ev4"]
. There is a duplicate but in the new events! That's detected by the use ofBTreeSet
. - The bloom filter now contains
["$ev0", "$ev1", "$ev2", "$ev3", "$ev4"]
.
This BTreeSet
is necessary to reliably detect duplicates in the new scanned events.
You're right the BTreeSet
is hit in the 2 paths inside check_and_set
but the bloom filter is still more memory-space efficient than if we were using BTreeSet
for new (currently scanned) and past (already scanned) events. That's the only reason I use a bloom filter here: to save memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks. How much memory are we saving, say, for 1K events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 scenarios: (i) BTreeSet
to track all events, (ii) Bloom filter to track all events (as this PR does):
number of events | size of BTreeSet (bytes) |
size of bloom filter (bytes) |
---|---|---|
100 | 1640 | 80 |
1000 | 16_040 | 80 |
10_000 | 160_040 | 80 |
100_000 | 1_600_040 | 80 |
1_000_000 | 16_000_040 | 88 |
The bloom filter is radically smaller, and almost constant in size!
The event ID is of formed $ev{nth}
, that's what's stored in the BTreeSet
or the bloom filter.
I had to clone growable-bloom-filters
, change a couple of internal stuff, patch the SDK, but at least we have numbers!
Update: I've updated the comment to include 1_000_000 events, because the Deduplicator::APPROXIMATED_MAXIMUM_NUMBER_OF_EVENTS
constant is set to 800_000. Just to see how it behaves.
3fc3f0a
to
6b43305
Compare
|
||
use super::*; | ||
use crate::test_utils::events::EventFactory; | ||
|
||
macro_rules! assert_events_eq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much nicer with this macro 🤩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is!
6b43305
to
35faeea
Compare
This patch introduces `Deduplicator`, an efficient type to detect duplicated events in the event cache. It uses a bloom filter, and decorates a collection of events with `Decoration`, which an enum that marks whether an event is unique, duplicated or invalid.
This patch uses the new `Deduplicator` type, along with `LinkeChunk::remove_item_at` to remove duplicated events. When a new event is received, the older one is removed.
35faeea
to
cf4ea95
Compare
This patch is based on #4173 (it must be merged before this one).This patch should be reviewed commit-by-commit.
Only the last two commits are new (the others are part of #4173):Deduplicator
, an efficient type to detect duplicated event. It uses a bloom filter (one we had from our dependencies already), and decorates all events with aDecoration::Unique
if the event is unique,Decoration::Duplicated
if it's duplicated, orDecoration::Invalid
if it's invalid (if it has noevent_id
).Deduplicator
andLinkedChunk::remove_item_at
(from feat(sdk): AddLinkedChunk::remove_item_at
#4173!) to remove duplicated items from the event cache/the linked chunk.In the presence of two duplicated events, the older one is removed. It mimics the current behaviour from the
Timeline
API.EventCache
storage #3280