Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize concurrently with sync flows #893

Merged
merged 9 commits into from
Jan 25, 2024
Merged

Normalize concurrently with sync flows #893

merged 9 commits into from
Jan 25, 2024

Conversation

serprex
Copy link
Contributor

@serprex serprex commented Dec 24, 2023

Previously after each sync we'd pause reading slot to process table schema deltas & normalize
This has two problems:

  1. we want to always be reading slot, we aren't reading slot during normalize
  2. merging multiple batches at once can be less expensive

Now NormalizeFlow is created as a child workflow at start of cdc flow & a signal is sent after each sync flow with schema updates. Normalize consumes all signals since it last checked, merging their processing in parallel with sync flows

NormalizeFlow only reads up to the signal's batch id to avoid potentially syncing a batch without its schema. This creates a range (normid..syncid] in which normid is always catching up to syncid as we normalize normid+1 to syncid. Normalize logic already handled this, so it goes untouched in this change

PEERDB_ENABLE_PARALLEL_SYNC_NORMALIZE needs to be set to true, for now keep this change behind feature flag to avoid potentially increasing data warehouse costs

@serprex serprex changed the title Normalize split Normalize concurrently with sync flows Dec 24, 2023
@serprex serprex requested a review from heavycrystal December 27, 2023 18:01
@serprex serprex force-pushed the normalize-split branch 5 times, most recently from f05a857 to e1976c4 Compare January 3, 2024 00:34
This was referenced Jan 3, 2024
@serprex serprex force-pushed the normalize-split branch 4 times, most recently from 1c351fa to dd40b2a Compare January 12, 2024 22:50
@serprex serprex force-pushed the normalize-split branch 4 times, most recently from 06613bd to 47ca6ac Compare January 13, 2024 21:45
@serprex serprex marked this pull request as ready for review January 13, 2024 22:15
serprex added a commit that referenced this pull request Jan 15, 2024
serprex added a commit that referenced this pull request Jan 15, 2024
A sync batch should not be considered complete until its schema changes are processed,
this avoids failures after commit causing schema changes to be dropped,
& when decoupling normalize/sync in #893 was causing normalization to be missing values
@serprex serprex force-pushed the normalize-split branch 10 times, most recently from 179b4ca to f2a5b7f Compare January 19, 2024 22:33
@serprex serprex merged commit 8adab3f into main Jan 25, 2024
7 checks passed
@serprex serprex deleted the normalize-split branch January 25, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants