PWX-33631: Hold round-robin lock only when needed #2335

ggriffiths · 2023-09-11T22:09:22Z

What this PR does / why we need it:
Hold round-robin lock only when needed. Before, we were holding the lock during enumerate calls as well as while connecting to the node itself.

We only need to hold the lock when keeping track of the connections, not when making external API calls.

Which issue(s) this PR fixes (optional)
Closes #

Special notes for your reviewer:

ggriffiths · 2023-09-12T01:12:38Z

Ran my longevity script w/ this change for about 15mins, re-creating a deleting 30 PVCs over and over.

$ k get pvc
NAME                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     AGE
px-nginx-pvc-sharedv4-1    Bound    pvc-341634f6-4dc7-43c2-a755-5fc06918b556   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-10   Bound    pvc-b02cc040-221a-4e39-b554-c11c7f0b1b6f   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-11   Bound    pvc-a43d7316-9382-49ed-9e23-b7dd3dd0290d   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-12   Bound    pvc-1655f627-d47f-4269-9211-550afbc5d407   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-13   Bound    pvc-90cb436f-5d3a-40a4-9f9d-6f90b6c2b5dd   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-14   Bound    pvc-6edbe043-0181-4c23-9519-e625b8d95ed3   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-15   Bound    pvc-418f39ad-8284-4c5a-9011-361da23cf6f1   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-16   Bound    pvc-d3eb4e95-832d-455c-a6f7-d5ce47998c61   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-17   Bound    pvc-2fa2ad97-7a7e-475e-abb4-7606ff1e7310   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-18   Bound    pvc-f10d28ae-91ed-4800-9c93-2e4c4fc36573   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-19   Bound    pvc-1291a268-6939-453e-85ed-ea864bb87919   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-2    Bound    pvc-59650ce2-b020-4c00-bedd-07f0c5ba2e3c   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-20   Bound    pvc-9789cbf7-4b57-4143-bc34-6f2a53a59525   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-21   Bound    pvc-53921c3c-08a6-4148-b52c-0322d63e29a3   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-22   Bound    pvc-a9fa9b0d-0acc-4bd3-90a6-c8fb4f7c39cb   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-23   Bound    pvc-59a64445-a6ec-43b0-9e33-61143fa001d1   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-24   Bound    pvc-85ac03a7-e117-46b1-bea2-05017d7ea700   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-25   Bound    pvc-2cc19095-c1a4-4c4e-93ee-34a88f6c0fc4   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-26   Bound    pvc-52fcc3ff-feed-4709-997b-c54f702451e5   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-27   Bound    pvc-5856c656-db09-4cdf-8ad9-8f98d12af815   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-28   Bound    pvc-57d68c53-a2d0-44be-baf7-4e8ce24be706   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-29   Bound    pvc-d43ef06f-87e1-4d48-8540-fa2e6410fa65   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-3    Bound    pvc-f20d6485-74fd-4dc8-956d-9e72ae2b3543   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-30   Bound    pvc-7c87eb2a-2305-4e87-a78f-0ddfc03c6b5f   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-4    Bound    pvc-851d2bc5-4f92-4c53-be6e-f01dfd9d3f37   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-5    Bound    pvc-97ca5e7b-d4d6-4b3c-89b2-a71aa139714e   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-6    Bound    pvc-a688314c-7015-4dd6-8b6c-c6750e7ba322   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-7    Bound    pvc-6aa2bf57-fffd-41ed-bfb8-21b9940d8b21   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-8    Bound    pvc-388666c3-9cbe-4bc6-a93b-c8652de1d8b3   100Gi      RWX            px-nginx-sc-v4   20s
px-nginx-pvc-sharedv4-9    Bound    pvc-e37f2d3a-6f3b-4ae3-a472-258b3ba413de   100Gi      RWX            px-nginx-sc-v4   20s

adityadani · 2023-09-12T21:37:25Z

pkg/loadbalancer/roundrobin.go

@@ -85,6 +82,39 @@ func (rr *roundRobin) GetRemoteNodeConnection(ctx context.Context) (*grpc.Client

 	// Get target node info and set next round robbin node.
 	// nextNode is always lastNode + 1 mod (numOfNodes), to loop back to zero
+	targetNodeEndpoint, isRemoteConn := rr.getTargetAndIncrement(&cluster)


Do you think we can remove the rr.cleanupMissingNodeConnections() call from line 81? Anyway we are doing that as part of the 15 min periodic cleanup. This will reduce some more processing time from this hot GetRemoteNodeConnection function?

Yeah, we could do that. However, the majority of the time spent is likely on the Enumerate and Connection create. cleanupMissingNodeConnections right here pretty much guarantees that a remote connection will be to a valid node. It also using the output from nodes.Enumerate which is already called earlier.

I could see a case where one removed node would cause round robin to halt on a dead node. In CSI for example, we default to the local node if a remote node fails:
https://github.com/libopenstorage/openstorage/blob/master/csi/controller.go#L552-L553

~~We could skip over dead nodes by incrementing the counter and retrying, but I think that would be slightly more involved change.~~

Nevermind - we increment the round robin target regardless of whether the connection succeeds or not. So, if node2 for example is dead, but node0, node1, and node3 are healthy, we will just use the local connection for node2 until it is removed. This will take a maximum of 15mins. So yeah, we can make this change

This could be optimized further by removing cluster.Enumerate in this hot path altogether in favor of a node watcher+cache in a separate thread, but we can save that for another time.

Signed-off-by: Grant Griffiths <[email protected]>

Signed-off-by: Grant Griffiths <[email protected]> Co-authored-by: Grant Griffiths <[email protected]>

Signed-off-by: Grant Griffiths <[email protected]>

* PWX-33631: Hold round-robin lock only when needed (#2335) Signed-off-by: Grant Griffiths <[email protected]> * PWX-33631: Remove global lock on getRemoteConn (#2339) Signed-off-by: Grant Griffiths <[email protected]> --------- Signed-off-by: Grant Griffiths <[email protected]>

Signed-off-by: Grant Griffiths <[email protected]>

github-actions bot added the size/m label Sep 11, 2023

ggriffiths requested a review from adityadani September 11, 2023 22:10

github-actions bot added the stage/in-review label Sep 11, 2023

ggriffiths force-pushed the PWX-33631 branch 2 times, most recently from 25376fc to 8564bf1 Compare September 11, 2023 22:19

ggriffiths requested a review from a team September 12, 2023 01:52

adityadani approved these changes Sep 12, 2023

View reviewed changes

PWX-33631: Hold round-robin lock only when needed

ae775b5

Signed-off-by: Grant Griffiths <[email protected]>

ggriffiths force-pushed the PWX-33631 branch from 8564bf1 to ae775b5 Compare September 12, 2023 22:44

ggriffiths added the cherry-pick/release-9.8 label Sep 13, 2023

ggriffiths merged commit f769b71 into libopenstorage:master Sep 13, 2023

github-actions bot pushed a commit that referenced this pull request Sep 13, 2023

PWX-33631: Hold round-robin lock only when needed (#2335)

f3bfe22

Signed-off-by: Grant Griffiths <[email protected]>

ggriffiths pushed a commit that referenced this pull request Sep 13, 2023

PWX-33631: Hold round-robin lock only when needed (#2335) (#2336)

8667aed

Signed-off-by: Grant Griffiths <[email protected]> Co-authored-by: Grant Griffiths <[email protected]>

ggriffiths pushed a commit to ggriffiths/openstorage that referenced this pull request Sep 19, 2023

PWX-33631: Hold round-robin lock only when needed (libopenstorage#2335)

7c2452f

Signed-off-by: Grant Griffiths <[email protected]>

ggriffiths pushed a commit to ggriffiths/openstorage that referenced this pull request Sep 19, 2023

PWX-33631: Hold round-robin lock only when needed (libopenstorage#2335)

691e2d0

Signed-off-by: Grant Griffiths <[email protected]>

ggriffiths mentioned this pull request Sep 19, 2023

PWX-33631: Hold round-robin lock only when needed/remove global lock #2342

Merged

ggriffiths pushed a commit that referenced this pull request Sep 19, 2023

PWX-33631: Hold round-robin lock only when needed (#2335)

cdd2bb9

Signed-off-by: Grant Griffiths <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PWX-33631: Hold round-robin lock only when needed #2335

PWX-33631: Hold round-robin lock only when needed #2335

ggriffiths commented Sep 11, 2023

ggriffiths commented Sep 12, 2023 •

edited

Loading

adityadani Sep 12, 2023

ggriffiths Sep 12, 2023 •

edited

Loading

ggriffiths Sep 12, 2023

PWX-33631: Hold round-robin lock only when needed #2335

PWX-33631: Hold round-robin lock only when needed #2335

Conversation

ggriffiths commented Sep 11, 2023

ggriffiths commented Sep 12, 2023 • edited Loading

adityadani Sep 12, 2023

Choose a reason for hiding this comment

ggriffiths Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

ggriffiths Sep 12, 2023

Choose a reason for hiding this comment

ggriffiths commented Sep 12, 2023 •

edited

Loading

ggriffiths Sep 12, 2023 •

edited

Loading