Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): Introduce event_cache::Deduplicator #4174

Merged
merged 3 commits into from
Oct 30, 2024

Conversation

Hywan
Copy link
Member

@Hywan Hywan commented Oct 28, 2024

This patch is based on #4173 (it must be merged before this one).

This patch should be reviewed commit-by-commit. Only the last two commits are new (the others are part of #4173):

  1. First commit aims at implementing Deduplicator, an efficient type to detect duplicated event. It uses a bloom filter (one we had from our dependencies already), and decorates all events with a Decoration::Unique if the event is unique, Decoration::Duplicated if it's duplicated, or Decoration::Invalid if it's invalid (if it has no event_id).
  2. Second commit uses Deduplicator and LinkedChunk::remove_item_at (from feat(sdk): Add LinkedChunk::remove_item_at #4173!) to remove duplicated items from the event cache/the linked chunk.

In the presence of two duplicated events, the older one is removed. It mimics the current behaviour from the Timeline API.


@Hywan Hywan requested a review from a team as a code owner October 28, 2024 13:05
@Hywan Hywan requested review from bnjbvr and removed request for a team October 28, 2024 13:05
@Hywan Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 8c41d63 to f621c8c Compare October 28, 2024 13:15
@Hywan Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from f621c8c to 8815d76 Compare October 28, 2024 13:19
Copy link

codecov bot commented Oct 28, 2024

Codecov Report

Attention: Patch coverage is 81.69014% with 13 lines in your changes missing coverage. Please review.

Project coverage is 84.83%. Comparing base (ce9dc73) to head (cf4ea95).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
crates/matrix-sdk/src/event_cache/room/events.rs 79.59% 10 Missing ⚠️
crates/matrix-sdk/src/event_cache/deduplicator.rs 85.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4174      +/-   ##
==========================================
- Coverage   84.85%   84.83%   -0.03%     
==========================================
  Files         270      271       +1     
  Lines       29031    29097      +66     
==========================================
+ Hits        24634    24684      +50     
- Misses       4397     4413      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Hywan Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 8815d76 to a453080 Compare October 28, 2024 16:18
@Hywan
Copy link
Member Author

Hywan commented Oct 28, 2024

Rebased on main since #4173 is now merged.

Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, thanks! Not sure if we need the Bloom filter though, would like to see some data backing that claim :)

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved
pub fn new() -> Self {
Self {
bloom_filter: Mutex::new(
GrowableBloomBuilder::new()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be the knight who says YAGNI, but… is it so much faster, considering we do have a btreeset check in both branches after the check_and_set below (equivalent to inserting into the btreeset)? Do you have data supporting this (a benchmark)?

Copy link
Member Author

@Hywan Hywan Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BTreeSet is used only to check the duplicates in the new events that we are checking. The bloom filter is to check the duplicates in past events. For example:

  • The bloom filter contains []. I check ["$ev0", "$ev1", "$ev2]. No duplicates.
  • The bloom filter now contains ["$ev0", "$ev1", "$ev2"]. I check ["$ev3"]. No duplicates.
  • The bloom filter now contains ["$ev0", "$ev1", "$ev2", "$ev3"]. I check ["$ev0"]. There is a duplicate.
  • The bloom filter now contains ["$ev0", "$ev1", "$ev2", "$ev3"]. I check ["$ev4", "$ev4"]. There is a duplicate but in the new events! That's detected by the use of BTreeSet.
  • The bloom filter now contains ["$ev0", "$ev1", "$ev2", "$ev3", "$ev4"].

This BTreeSet is necessary to reliably detect duplicates in the new scanned events.

You're right the BTreeSet is hit in the 2 paths inside check_and_set but the bloom filter is still more memory-space efficient than if we were using BTreeSet for new (currently scanned) and past (already scanned) events. That's the only reason I use a bloom filter here: to save memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. How much memory are we saving, say, for 1K events?

Copy link
Member Author

@Hywan Hywan Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 scenarios: (i) BTreeSet to track all events, (ii) Bloom filter to track all events (as this PR does):

number of events size of BTreeSet (bytes) size of bloom filter (bytes)
100 1640 80
1000 16_040 80
10_000 160_040 80
100_000 1_600_040 80
1_000_000 16_000_040 88

The bloom filter is radically smaller, and almost constant in size!

The event ID is of formed $ev{nth}, that's what's stored in the BTreeSet or the bloom filter.

I had to clone growable-bloom-filters, change a couple of internal stuff, patch the SDK, but at least we have numbers!

Update: I've updated the comment to include 1_000_000 events, because the Deduplicator::APPROXIMATED_MAXIMUM_NUMBER_OF_EVENTS constant is set to 800_000. Just to see how it behaves.

crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/linked_chunk/mod.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/store.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/store.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/store.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/deduplicator.rs Outdated Show resolved Hide resolved
@Hywan Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch 2 times, most recently from 3fc3f0a to 6b43305 Compare October 29, 2024 13:59
crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved

use super::*;
use crate::test_utils::events::EventFactory;

macro_rules! assert_events_eq {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much nicer with this macro 🤩

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is!

@Hywan Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 6b43305 to 35faeea Compare October 29, 2024 15:14
This patch introduces `Deduplicator`, an efficient type to detect
duplicated events in the event cache. It uses a bloom filter, and
decorates a collection of events with `Decoration`, which an enum that
marks whether an event is unique, duplicated or invalid.
This patch uses the new `Deduplicator` type, along with
`LinkeChunk::remove_item_at` to remove duplicated events. When a new
event is received, the older one is removed.
@Hywan Hywan force-pushed the feat-sdk-event-cache-deduplicator-2 branch from 35faeea to cf4ea95 Compare October 30, 2024 14:13
@Hywan Hywan enabled auto-merge (rebase) October 30, 2024 14:17
@Hywan Hywan merged commit 71abbeb into matrix-org:main Oct 30, 2024
41 checks passed
@Hywan Hywan mentioned this pull request Nov 28, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants