Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka v2 changefeed fails to shut down #132198

Closed
asg0451 opened this issue Oct 8, 2024 · 3 comments · Fixed by #132761
Closed

Kafka v2 changefeed fails to shut down #132198

asg0451 opened this issue Oct 8, 2024 · 3 comments · Fixed by #132761
Assignees
Labels
A-cdc Change Data Capture branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-cdc

Comments

@asg0451
Copy link
Contributor

asg0451 commented Oct 8, 2024

The drt cluster observed changefeeds getting stuck on cancel. Investigation turned up a deadlock on shutdown in the library used by the kafka v2 sink. See conversation in slack and the upstream issue

The new sink is available in v23.2.10, v24.1.4, and v24.2.0. It is the default kafka sink in v24.2.0+.

Jira issue: CRDB-42871

@asg0451 asg0451 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Oct 8, 2024
Copy link

blathers-crl bot commented Oct 8, 2024

cc @cockroachdb/cdc

Copy link

blathers-crl bot commented Oct 8, 2024

Hi @asg0451, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@asg0451 asg0451 added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Oct 8, 2024
@BabuSrithar BabuSrithar added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Oct 9, 2024
@twmb
Copy link

twmb commented Oct 15, 2024

franz-go v1.18 released

asg0451 added a commit to asg0451/cockroach that referenced this issue Oct 16, 2024
Bump the version of franz-go to get a fix for an
issue where it could deadlock on client shutdown.

Fixes: cockroachdb#132198

Release note (bug fix): The franz-go library has
been updated to fix a potential deadlock on changefeed restarts.
craig bot pushed a commit that referenced this issue Oct 16, 2024
132703: raft: match MsgFortifyLeaderResp to MsgHeartbeatResp r=iskettaneh a=iskettaneh

When we receive a MsgHeartbeatResp, we attempt to send a MsgApp to the follower because it means that the follower is active.

This commit does the same when we receive a MsgFortifyLeaderResp. If we don't do this, we will need to wait until the next heartbeat timeout to attempt to send the MsgApp, which might increase the latency unnecessarily.

Epic: None

Release note: None

132761: changefeedccl: bump franz-go dependency to fix deadlock r=rharding6373 a=asg0451

Bump the version of franz-go to get a fix for an
issue where it could deadlock on client shutdown.

Fixes: #132198

Release note (bug fix): The franz-go library has
been updated to fix a potential deadlock on changefeed restarts.


132771: bincheck: bump `max_init_time` for `darwin-amd64` r=celiala a=rail

Previously, we started measuring the `init` times of the cockroach binary, but never ran them in GiHub Actions. GitHub Actions runners for darwin-amd64 look a bit slow. Last runs showed values around 1500-2000.

This PR bumps `the max_init_time` value to `3000`.

Epic: none
Release note: None

Co-authored-by: Ibrahim Kettaneh <[email protected]>
Co-authored-by: Miles Frankel <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
@craig craig bot closed this as completed in 84d12ed Oct 16, 2024
asg0451 added a commit to asg0451/cockroach that referenced this issue Oct 16, 2024
Bump the version of franz-go to get a fix for an
issue where it could deadlock on client shutdown.

Fixes: cockroachdb#132198

Release note (bug fix): The franz-go library has
been updated to fix a potential deadlock on changefeed restarts.
asg0451 added a commit to asg0451/cockroach that referenced this issue Oct 16, 2024
Bump the version of franz-go to get a fix for an
issue where it could deadlock on client shutdown.

Fixes: cockroachdb#132198

Release note (bug fix): The franz-go library has
been updated to fix a potential deadlock on changefeed restarts.
asg0451 added a commit to asg0451/cockroach that referenced this issue Oct 16, 2024
Bump the version of franz-go to get a fix for an
issue where it could deadlock on client shutdown.

Fixes: cockroachdb#132198

Release note (bug fix): The franz-go library has
been updated to fix a potential deadlock on changefeed restarts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants