New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(sdk): Introduce `event_cache::Deduplicator` #4174

Merged

Hywan merged 3 commits into matrix-org:main from Hywan:feat-sdk-event-cache-deduplicator-2

Oct 30, 2024

Member

Hywan commented Oct 28, 2024 •

edited

Loading

~~This patch is based on #4173 (it must be merged before this one).~~

This patch should be reviewed commit-by-commit. ~~Only the last two commits are new (the others are part of #4173):~~

First commit aims at implementing Deduplicator, an efficient type to detect duplicated event. It uses a bloom filter (one we had from our dependencies already), and decorates all events with a Decoration::Unique if the event is unique, Decoration::Duplicated if it's duplicated, or Decoration::Invalid if it's invalid (if it has no event_id).
Second commit uses Deduplicator and LinkedChunk::remove_item_at (from feat(sdk): Add LinkedChunk::remove_item_at #4173!) to remove duplicated items from the event cache/the linked chunk.

In the presence of two duplicated events, the older one is removed. It mimics the current behaviour from the Timeline API.

Address [meta] EventCache storage #3280

Hywan requested a review from a team as a code owner

October 28, 2024 13:05

Hywan requested review from bnjbvr and removed request for a team

October 28, 2024 13:05

Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 8c41d63 to f621c8c Compare

October 28, 2024 13:15

Hywan mentioned this pull request

feat(sdk): Implement event_cache::Deduplicator #3580

Closed

Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from f621c8c to 8815d76 Compare

October 28, 2024 13:19

codecov bot commented Oct 28, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 81.69014% with 13 lines in your changes missing coverage. Please review.

Project coverage is 84.83%. Comparing base (ce9dc73) to head (cf4ea95).
Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/matrix-sdk/src/event_cache/room/events.rs	79.59%	10 Missing ⚠️
crates/matrix-sdk/src/event_cache/deduplicator.rs	85.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4174      +/-   ##
==========================================
- Coverage   84.85%   84.83%   -0.03%     
==========================================
  Files         270      271       +1     
  Lines       29031    29097      +66     
==========================================
+ Hits        24634    24684      +50     
- Misses       4397     4413      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 8815d76 to a453080 Compare

October 28, 2024 16:18

Member Author

Hywan commented Oct 28, 2024

Rebased on main since #4173 is now merged.

bnjbvr requested changes

View reviewed changes

Member

bnjbvr left a comment

Looks good overall, thanks! Not sure if we need the Bloom filter though, would like to see some data backing that claim :)

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/deduplicator.rs

+                  pub fn new() -> Self {
+                      Self {
+                          bloom_filter: Mutex::new(
+                              GrowableBloomBuilder::new()

Member

bnjbvr Oct 29, 2024

Sorry to be the knight who says YAGNI, but… is it so much faster, considering we do have a btreeset check in both branches after the check_and_set below (equivalent to inserting into the btreeset)? Do you have data supporting this (a benchmark)?

Member Author

Hywan Oct 29, 2024 •

edited

Loading

The BTreeSet is used only to check the duplicates in the new events that we are checking. The bloom filter is to check the duplicates in past events. For example:

The bloom filter contains []. I check ["$ev0", "$ev1", "$ev2]. No duplicates.
The bloom filter now contains ["$ev0", "$ev1", "$ev2"]. I check ["$ev3"]. No duplicates.
The bloom filter now contains ["$ev0", "$ev1", "$ev2", "$ev3"]. I check ["$ev0"]. There is a duplicate.
The bloom filter now contains ["$ev0", "$ev1", "$ev2", "$ev3"]. I check ["$ev4", "$ev4"]. There is a duplicate but in the new events! That's detected by the use of BTreeSet.
The bloom filter now contains ["$ev0", "$ev1", "$ev2", "$ev3", "$ev4"].

This BTreeSet is necessary to reliably detect duplicates in the new scanned events.

You're right the BTreeSet is hit in the 2 paths inside check_and_set but the bloom filter is still more memory-space efficient than if we were using BTreeSet for new (currently scanned) and past (already scanned) events. That's the only reason I use a bloom filter here: to save memory.

Member

bnjbvr Oct 29, 2024

makes sense, thanks. How much memory are we saving, say, for 1K events?

Member Author

Hywan Oct 29, 2024 •

edited

Loading

2 scenarios: (i) BTreeSet to track all events, (ii) Bloom filter to track all events (as this PR does):

number of events	size of `BTreeSet` (bytes)	size of bloom filter (bytes)
100	1640	80
1000	16_040	80
10_000	160_040	80
100_000	1_600_040	80
1_000_000	16_000_040	88

The bloom filter is radically smaller, and almost constant in size!

The event ID is of formed $ev{nth}, that's what's stored in the BTreeSet or the bloom filter.

I had to clone growable-bloom-filters, change a couple of internal stuff, patch the SDK, but at least we have numbers!

Update: I've updated the comment to include 1_000_000 events, because the Deduplicator::APPROXIMATED_MAXIMUM_NUMBER_OF_EVENTS constant is set to 800_000. Just to see how it behaves.

crates/matrix-sdk/src/event_cache/deduplicator.rs Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/linked_chunk/mod.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/store.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/store.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/store.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved

Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch 2 times, most recently from 3fc3f0a to 6b43305 Compare

October 29, 2024 13:59

bnjbvr reviewed

View reviewed changes

crates/matrix-sdk/src/event_cache/deduplicator.rs Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved

crates/matrix-sdk/src/event_cache/room/events.rs

                   use super::*;
+                  use crate::test_utils::events::EventFactory;
+                  macro_rules! assert_events_eq {

Member

bnjbvr Oct 29, 2024

So much nicer with this macro 🤩

Member Author

Hywan Oct 29, 2024

It is!

Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 6b43305 to 35faeea Compare

October 29, 2024 15:14

bnjbvr approved these changes

View reviewed changes

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved

Hywan added 3 commits

October 30, 2024 15:13


          feat(sdk): Introduce event_cache::Deduplicator.

6b23e91

This patch introduces `Deduplicator`, an efficient type to detect
duplicated events in the event cache. It uses a bloom filter, and
decorates a collection of events with `Decoration`, which an enum that
marks whether an event is unique, duplicated or invalid.


          feat(sdk): Find and remove duplicated events in RoomEvents.

1849a8b

This patch uses the new `Deduplicator` type, along with
`LinkeChunk::remove_item_at` to remove duplicated events. When a new
event is received, the older one is removed.


          test(sdk): Use EventFactory to simplify the test cases.

cf4ea95

Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 35faeea to cf4ea95 Compare

October 30, 2024 14:13

Hywan enabled auto-merge (rebase)

October 30, 2024 14:17

Hywan merged commit 71abbeb into matrix-org:main

41 checks passed

Hywan mentioned this pull request

[meta] EventCache storage #3280

Open

34 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet