Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWX-33631: Hold round-robin lock only when needed #2335

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

ggriffiths
Copy link
Contributor

What this PR does / why we need it:
Hold round-robin lock only when needed. Before, we were holding the lock during enumerate calls as well as while connecting to the node itself.

We only need to hold the lock when keeping track of the connections, not when making external API calls.

Which issue(s) this PR fixes (optional)
Closes #

Special notes for your reviewer:

@ggriffiths
Copy link
Contributor Author

ggriffiths commented Sep 12, 2023

Ran my longevity script w/ this change for about 15mins, re-creating a deleting 30 PVCs over and over.

$ k get pvc
NAME                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     AGE
px-nginx-pvc-sharedv4-1    Bound    pvc-341634f6-4dc7-43c2-a755-5fc06918b556   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-10   Bound    pvc-b02cc040-221a-4e39-b554-c11c7f0b1b6f   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-11   Bound    pvc-a43d7316-9382-49ed-9e23-b7dd3dd0290d   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-12   Bound    pvc-1655f627-d47f-4269-9211-550afbc5d407   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-13   Bound    pvc-90cb436f-5d3a-40a4-9f9d-6f90b6c2b5dd   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-14   Bound    pvc-6edbe043-0181-4c23-9519-e625b8d95ed3   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-15   Bound    pvc-418f39ad-8284-4c5a-9011-361da23cf6f1   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-16   Bound    pvc-d3eb4e95-832d-455c-a6f7-d5ce47998c61   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-17   Bound    pvc-2fa2ad97-7a7e-475e-abb4-7606ff1e7310   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-18   Bound    pvc-f10d28ae-91ed-4800-9c93-2e4c4fc36573   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-19   Bound    pvc-1291a268-6939-453e-85ed-ea864bb87919   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-2    Bound    pvc-59650ce2-b020-4c00-bedd-07f0c5ba2e3c   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-20   Bound    pvc-9789cbf7-4b57-4143-bc34-6f2a53a59525   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-21   Bound    pvc-53921c3c-08a6-4148-b52c-0322d63e29a3   100Gi      RWX            px-nginx-sc-v4   22s
px-nginx-pvc-sharedv4-22   Bound    pvc-a9fa9b0d-0acc-4bd3-90a6-c8fb4f7c39cb   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-23   Bound    pvc-59a64445-a6ec-43b0-9e33-61143fa001d1   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-24   Bound    pvc-85ac03a7-e117-46b1-bea2-05017d7ea700   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-25   Bound    pvc-2cc19095-c1a4-4c4e-93ee-34a88f6c0fc4   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-26   Bound    pvc-52fcc3ff-feed-4709-997b-c54f702451e5   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-27   Bound    pvc-5856c656-db09-4cdf-8ad9-8f98d12af815   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-28   Bound    pvc-57d68c53-a2d0-44be-baf7-4e8ce24be706   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-29   Bound    pvc-d43ef06f-87e1-4d48-8540-fa2e6410fa65   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-3    Bound    pvc-f20d6485-74fd-4dc8-956d-9e72ae2b3543   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-30   Bound    pvc-7c87eb2a-2305-4e87-a78f-0ddfc03c6b5f   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-4    Bound    pvc-851d2bc5-4f92-4c53-be6e-f01dfd9d3f37   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-5    Bound    pvc-97ca5e7b-d4d6-4b3c-89b2-a71aa139714e   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-6    Bound    pvc-a688314c-7015-4dd6-8b6c-c6750e7ba322   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-7    Bound    pvc-6aa2bf57-fffd-41ed-bfb8-21b9940d8b21   100Gi      RWX            px-nginx-sc-v4   21s
px-nginx-pvc-sharedv4-8    Bound    pvc-388666c3-9cbe-4bc6-a93b-c8652de1d8b3   100Gi      RWX            px-nginx-sc-v4   20s
px-nginx-pvc-sharedv4-9    Bound    pvc-e37f2d3a-6f3b-4ae3-a472-258b3ba413de   100Gi      RWX            px-nginx-sc-v4   20s

@ggriffiths ggriffiths requested a review from a team September 12, 2023 01:52
@@ -85,6 +82,39 @@ func (rr *roundRobin) GetRemoteNodeConnection(ctx context.Context) (*grpc.Client

// Get target node info and set next round robbin node.
// nextNode is always lastNode + 1 mod (numOfNodes), to loop back to zero
targetNodeEndpoint, isRemoteConn := rr.getTargetAndIncrement(&cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can remove the rr.cleanupMissingNodeConnections() call from line 81? Anyway we are doing that as part of the 15 min periodic cleanup. This will reduce some more processing time from this hot GetRemoteNodeConnection function?

Copy link
Contributor Author

@ggriffiths ggriffiths Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do that. However, the majority of the time spent is likely on the Enumerate and Connection create. cleanupMissingNodeConnections right here pretty much guarantees that a remote connection will be to a valid node. It also using the output from nodes.Enumerate which is already called earlier.

I could see a case where one removed node would cause round robin to halt on a dead node. In CSI for example, we default to the local node if a remote node fails:
https://github.com/libopenstorage/openstorage/blob/master/csi/controller.go#L552-L553

We could skip over dead nodes by incrementing the counter and retrying, but I think that would be slightly more involved change.

Nevermind - we increment the round robin target regardless of whether the connection succeeds or not. So, if node2 for example is dead, but node0, node1, and node3 are healthy, we will just use the local connection for node2 until it is removed. This will take a maximum of 15mins. So yeah, we can make this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimized further by removing cluster.Enumerate in this hot path altogether in favor of a node watcher+cache in a separate thread, but we can save that for another time.

@ggriffiths ggriffiths merged commit f769b71 into libopenstorage:master Sep 13, 2023
github-actions bot pushed a commit that referenced this pull request Sep 13, 2023
ggriffiths pushed a commit that referenced this pull request Sep 13, 2023
Signed-off-by: Grant Griffiths <[email protected]>
Co-authored-by: Grant Griffiths <[email protected]>
ggriffiths pushed a commit to ggriffiths/openstorage that referenced this pull request Sep 19, 2023
ggriffiths pushed a commit that referenced this pull request Sep 19, 2023
* PWX-33631: Hold round-robin lock only when needed (#2335)

Signed-off-by: Grant Griffiths <[email protected]>

* PWX-33631: Remove global lock on getRemoteConn (#2339)

Signed-off-by: Grant Griffiths <[email protected]>

---------

Signed-off-by: Grant Griffiths <[email protected]>
ggriffiths pushed a commit to ggriffiths/openstorage that referenced this pull request Sep 19, 2023
ggriffiths pushed a commit that referenced this pull request Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants