Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[breaking] BQ SyncRecords now streams properly, code cleanup #909

Merged
merged 1 commit into from
Dec 27, 2023

Conversation

heavycrystal
Copy link
Contributor

@heavycrystal heavycrystal commented Dec 27, 2023

⚠️ This change can break existing CDC mirrors from Postgres to BigQuery!

Fixes a bug in BigQuery CDC where a batch with number of records greater than 2 ** 20 causes the Avro file generation part to hang. This is because all records were being written to a bounded channel first and then consumed by the Avro writer instead of the 2 operations happening in parallel. With a large number of records, the channel would fill up and block before the records finished writing, leading to the loop deadlocking itself.

Fixed by switching BigQuery record generation to the mechanism used by Snowflake, where the record generation happens in another goroutine and therefore the channel consumption happens in parallel. As part of this change, some code was cleaned up and the BigQuery raw table schema was changed in a breaking manner to be similar to the SF/PG equivalent. Specifically, the column _peerdb_timestamp of type TIMESTAMP was removed and the column _peerdb_timestamp_nanos of type INTEGER was renamed to the former. Existing raw tables will need to be fixed up to match this new, simpler schema.

ALTER TABLE <...> DROP COLUMN _peerdb_timestamp;
ALTER TABLE <...> RENAME COLUMN _peerdb_timestamp_nanos TO _peerdb_timestamp;

Closes #908

Copy link
Contributor

@Amogh-Bharadwaj Amogh-Bharadwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iskakaushik iskakaushik merged commit a3b2800 into main Dec 27, 2023
@serprex serprex deleted the bq-avro-streaming-fixes branch July 19, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: SyncRecords - Have a parallel goroutine like in Snowflake
3 participants