You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mattlord and I found an issue that arise every time we try to acquire a ZooKeeper Topo lock, we increment an ID by one and add it to a list. If that ID is the same as the first item on the list, we acquire the lock. If not, we look for the previous ID on the list matching our current ID, and wait for any update on that ID/Lock to proceed, or cancel the acquisition if the context has been canceled or has reached a timeout.
When we end up reaching a context timeout, we error out, and we remove the ID from the list so that new locks will look for still-existing IDs. However, the entire clean up phase depends on the same context variable as the one used to acquire the lock. Meaning that if the context was canceled, the clean up phase fails, and thus the ID is not removed making it impossible in the future to get any lock.
This bug was hidden by a bug in golang.org/x/sync which got fixed in golang.org/x/[email protected] (here). Basically, the semaphore package used to not check for canceled context when acquiring new locks, but now it is, meaning that the semaphore package was still proceeding although the context was done. We started running into this while trying to upgrade the x/sync dependency as part of #15743. After upgrading to the latest version, the issue is easily reproducible by running:
go test -v -count 1 -timeout 30s -run ^TestZk2Topo$ vitess.io/vitess/go/vt/topo/zk2topo
To highlight the issue here is a modified extract of the logs we get when running the test:
# We are adding 0000000000 to the list
[0000000000]
# We are adding 0000000001 to the list
[0000000001 0000000000]
# We are waiting for 0000000000 to be unlocked
watching: /test-0/global/keyspaces/test_keyspace/locks/0000000000 exists: true
# 0000000001 could not be acquired, context timeout
ctx done err
# Below the whole clean up phase of 0000000001 fails
W0418 17:10:05.914856 30904 lock.go:98] Failed to obtain action lock: context deadline exceeded
W0418 17:10:05.915651 30904 lock.go:101] Failed to close connection :context deadline exceeded
W0418 17:10:05.915659 30904 lock.go:108] Failed to get children of /test-0/global/keyspaces/test_keyspace/locks: context deadline exceeded
# We are adding 0000000002 to the list
[0000000001 0000000000 0000000002]
# We are waiting for 0000000001 to be unlocked, although it will never be unlocked
watching: /test-0/global/keyspaces/test_keyspace/locks/0000000001 exists: true
# 0000000002 could not be acquired, context timeout
ctx done err
# Below the whole clean up phase of 0000000002 fails
W0418 17:10:06.916236 30904 lock.go:98] Failed to obtain action lock: context canceled
W0418 17:10:06.916302 30904 lock.go:101] Failed to close connection :context canceled
W0418 17:10:06.916314 30904 lock.go:108] Failed to get children of /test-0/global/keyspaces/test_keyspace/locks: context canceled
Overview of the Issue
@mattlord and I found an issue that arise every time we try to acquire a ZooKeeper Topo lock, we increment an ID by one and add it to a list. If that ID is the same as the first item on the list, we acquire the lock. If not, we look for the previous ID on the list matching our current ID, and wait for any update on that ID/Lock to proceed, or cancel the acquisition if the context has been canceled or has reached a timeout.
When we end up reaching a context timeout, we error out, and we remove the ID from the list so that new locks will look for still-existing IDs. However, the entire clean up phase depends on the same context variable as the one used to acquire the lock. Meaning that if the context was canceled, the clean up phase fails, and thus the ID is not removed making it impossible in the future to get any lock.
This bug was hidden by a bug in
golang.org/x/sync
which got fixed ingolang.org/x/[email protected]
(here). Basically, the semaphore package used to not check for canceled context when acquiring new locks, but now it is, meaning that the semaphore package was still proceeding although the context was done. We started running into this while trying to upgrade thex/sync
dependency as part of #15743. After upgrading to the latest version, the issue is easily reproducible by running:To highlight the issue here is a modified extract of the logs we get when running the test:
Reproduction Steps
Upgrade to
golang.org/x/[email protected]
and run:Binary Version
main + all supported release branches
Operating System and Environment details
any
Log Fragments
No response
The text was updated successfully, but these errors were encountered: