-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PWX-33631: Hold round-robin lock only when needed #2335
Conversation
25376fc
to
8564bf1
Compare
Ran my longevity script w/ this change for about 15mins, re-creating a deleting 30 PVCs over and over.
|
@@ -85,6 +82,39 @@ func (rr *roundRobin) GetRemoteNodeConnection(ctx context.Context) (*grpc.Client | |||
|
|||
// Get target node info and set next round robbin node. | |||
// nextNode is always lastNode + 1 mod (numOfNodes), to loop back to zero | |||
targetNodeEndpoint, isRemoteConn := rr.getTargetAndIncrement(&cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can remove the rr.cleanupMissingNodeConnections()
call from line 81? Anyway we are doing that as part of the 15 min periodic cleanup. This will reduce some more processing time from this hot GetRemoteNodeConnection function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could do that. However, the majority of the time spent is likely on the Enumerate and Connection create. cleanupMissingNodeConnections
right here pretty much guarantees that a remote connection will be to a valid node. It also using the output from nodes.Enumerate which is already called earlier.
I could see a case where one removed node would cause round robin to halt on a dead node. In CSI for example, we default to the local node if a remote node fails:
https://github.com/libopenstorage/openstorage/blob/master/csi/controller.go#L552-L553
We could skip over dead nodes by incrementing the counter and retrying, but I think that would be slightly more involved change.
Nevermind - we increment the round robin target regardless of whether the connection succeeds or not. So, if node2 for example is dead, but node0, node1, and node3 are healthy, we will just use the local connection for node2 until it is removed. This will take a maximum of 15mins. So yeah, we can make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be optimized further by removing cluster.Enumerate in this hot path altogether in favor of a node watcher+cache in a separate thread, but we can save that for another time.
Signed-off-by: Grant Griffiths <[email protected]>
8564bf1
to
ae775b5
Compare
Signed-off-by: Grant Griffiths <[email protected]>
Signed-off-by: Grant Griffiths <[email protected]> Co-authored-by: Grant Griffiths <[email protected]>
Signed-off-by: Grant Griffiths <[email protected]>
* PWX-33631: Hold round-robin lock only when needed (#2335) Signed-off-by: Grant Griffiths <[email protected]> * PWX-33631: Remove global lock on getRemoteConn (#2339) Signed-off-by: Grant Griffiths <[email protected]> --------- Signed-off-by: Grant Griffiths <[email protected]>
Signed-off-by: Grant Griffiths <[email protected]>
Signed-off-by: Grant Griffiths <[email protected]>
What this PR does / why we need it:
Hold round-robin lock only when needed. Before, we were holding the lock during enumerate calls as well as while connecting to the node itself.
We only need to hold the lock when keeping track of the connections, not when making external API calls.
Which issue(s) this PR fixes (optional)
Closes #
Special notes for your reviewer: