Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeper Topo locks not being cleaned up properly in case of context timeout #15756

Closed
frouioui opened this issue Apr 18, 2024 · 0 comments · Fixed by #15757
Closed

ZooKeeper Topo locks not being cleaned up properly in case of context timeout #15756

frouioui opened this issue Apr 18, 2024 · 0 comments · Fixed by #15757

Comments

@frouioui
Copy link
Member

frouioui commented Apr 18, 2024

Overview of the Issue

@mattlord and I found an issue that arise every time we try to acquire a ZooKeeper Topo lock, we increment an ID by one and add it to a list. If that ID is the same as the first item on the list, we acquire the lock. If not, we look for the previous ID on the list matching our current ID, and wait for any update on that ID/Lock to proceed, or cancel the acquisition if the context has been canceled or has reached a timeout.

When we end up reaching a context timeout, we error out, and we remove the ID from the list so that new locks will look for still-existing IDs. However, the entire clean up phase depends on the same context variable as the one used to acquire the lock. Meaning that if the context was canceled, the clean up phase fails, and thus the ID is not removed making it impossible in the future to get any lock.

This bug was hidden by a bug in golang.org/x/sync which got fixed in golang.org/x/[email protected] (here). Basically, the semaphore package used to not check for canceled context when acquiring new locks, but now it is, meaning that the semaphore package was still proceeding although the context was done. We started running into this while trying to upgrade the x/sync dependency as part of #15743. After upgrading to the latest version, the issue is easily reproducible by running:

go test -v -count 1 -timeout 30s -run ^TestZk2Topo$ vitess.io/vitess/go/vt/topo/zk2topo

To highlight the issue here is a modified extract of the logs we get when running the test:

# We are adding 0000000000 to the list
[0000000000]

# We are adding 0000000001 to the list
[0000000001 0000000000]

# We are waiting for 0000000000 to be unlocked
watching:  /test-0/global/keyspaces/test_keyspace/locks/0000000000 exists:  true

# 0000000001 could not be acquired, context timeout
ctx done err

# Below the whole clean up phase of 0000000001 fails
W0418 17:10:05.914856   30904 lock.go:98] Failed to obtain action lock: context deadline exceeded
W0418 17:10:05.915651   30904 lock.go:101] Failed to close connection :context deadline exceeded
W0418 17:10:05.915659   30904 lock.go:108] Failed to get children of /test-0/global/keyspaces/test_keyspace/locks: context deadline exceeded

# We are adding 0000000002 to the list
[0000000001 0000000000 0000000002]

# We are waiting for 0000000001 to be unlocked, although it will never be unlocked
watching:  /test-0/global/keyspaces/test_keyspace/locks/0000000001 exists:  true

# 0000000002 could not be acquired, context timeout
ctx done err

# Below the whole clean up phase of 0000000002 fails
W0418 17:10:06.916236   30904 lock.go:98] Failed to obtain action lock: context canceled
W0418 17:10:06.916302   30904 lock.go:101] Failed to close connection :context canceled
W0418 17:10:06.916314   30904 lock.go:108] Failed to get children of /test-0/global/keyspaces/test_keyspace/locks: context canceled

Reproduction Steps

Upgrade to golang.org/x/[email protected] and run:

go test -v -count 1 -timeout 30s -run ^TestZk2Topo$ vitess.io/vitess/go/vt/topo/zk2topo

Binary Version

main + all supported release branches

Operating System and Environment details

any

Log Fragments

No response

@frouioui frouioui added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: Topology and removed Needs Triage This issue needs to be correctly labelled and triaged labels Apr 18, 2024
@frouioui frouioui self-assigned this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant