Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove a unnecessary Thread.sleep() from ClusterCommandExecutor.collectResults() #2518

Closed
dongmyo opened this issue Mar 2, 2023 · 8 comments
Assignees
Labels
in: core Issues in core support type: task A general task

Comments

@dongmyo
Copy link

dongmyo commented Mar 2, 2023

https://github.com/spring-projects/spring-data-redis/blob/main/src/main/java/org/springframework/data/redis/connection/ClusterCommandExecutor.java#L224

when for loop ends with all futures done or cancelled in ClusterCommandExecutor.collectResults().

https://github.com/spring-projects/spring-data-redis/blob/main/src/main/java/org/springframework/data/redis/connection/ClusterCommandExecutor.java#L258

after that, there is no need for thread to be slept, but Thread.sleep() is called unnecessarily.

it could be better to check if the done is false.

try {
    if (!done) {
        Thread.sleep(10);
    }
} catch (InterruptedException e) {
    // ...   
}
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Mar 2, 2023
@jxblum jxblum self-assigned this Mar 4, 2023
@jxblum
Copy link
Contributor

jxblum commented Mar 8, 2023

Hi @dongmyo -

I appreciate you raising this issue; nice catch!

I also think you are generally correct, here. However, I have been giving this some serious thought, off-and-on, for the past few days, while working through other SD Redis Issues in our backlog.

In summary, I am thinking this Thread.sleep() call should be removed altogether as I am not entirely convinced that 1) it is strictly necessary, and 2) could be achieved in another, more productive way (such as using Future.get(timeout, :TimeUnit), here). I need to carefully think this through.

I think the code will need to be restructured, which should help simplify the code a bit since it is rather complex to reason about anyway, IMO. It also does not appropriately handle the CancellationException that can be thrown from Future.get(), which is not documented (nor here) either, where the collectResults(..) method is invoked (i.e. from methods that are part of the public API).

In general, arbitrarily calling Thread.sleep(..) is dangerous for at least 2 reasons:

  1. Threads in Java are native to the platform on which the JRE and Java application are built and run (unlike the Green Threads, or Project Loom, to an extent, which is now on the horizon). Therefore, Java Threads are subject to the vagaries of scheduling and behavior consistent with the underlying OS and Thread model in use.

  2. Secondly, arbitrary Thread.sleep(..) timeouts (e.g. 10 milliseconds +/-) might be perfectly acceptable in some applications while detrimental in others. What is too much or too little? And with Redis being all about efficient/fast access to data, this sort of induced latency is questionable at best.

My plan of attack is to write some tests for this area of code first so that I can then begin to make (structural) changes that ultimately get rid of the Thread.sleep(..) altogether.

I will post back when I have more information to share.

@jxblum jxblum added the in: core Issues in core support label Mar 8, 2023
@dongmyo
Copy link
Author

dongmyo commented Mar 9, 2023

@jxblum thank you for your response.
either way, it will be fine as long as unnecessary Thread.sleep() is going to be removed.

@jxblum jxblum added status: on-hold We cannot start working on this issue yet and removed status: waiting-for-triage An issue we've not yet triaged labels Aug 8, 2023
@jxblum
Copy link
Contributor

jxblum commented Sep 27, 2023

I am finally, now circling back to this.

Knowing significantly more about Spring Data Redis and just Redis in general now, my thinking is this "arbitrary" Thread.sleep() call (here) can be simply removed altogether. It really serves no purpose.

The Future.get() calls (here and here) "waits until the computation completes" anyway.

So, I think it is also unnecessary to use Future.get(timeout, :TimeUnit) in this case given blocking/waiting nature of Future.get() in the first place. Although, it may not be the most effective way of "collecting results" across the nodes in the Redis cluster.

Additionally, after looking through this code, I also think there is a bug!

If execution fails for any node when evaluating the result, then there is no point to re-evaluate the node on a subsequent iteration (in the while loop). That is, I think the futureId should be added to the saveGuard (as so) inside the ExecutionException handler block.

Finally, it is not necessary to "handle" the CancellationException as I previously stated.

@rishiraj88
Copy link

Very impressive issue. Thanks.

jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 27, 2023
Keep Thread.sleep(..) to avoid the busy (hot) loop when the Future executions are not yet complete.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
@jxblum
Copy link
Contributor

jxblum commented Sep 27, 2023

Opened PR #2719, which is pending review and merge. Buildable changes captured in issue branch issue/2518.

jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 27, 2023
Keep Thread.sleep(..) to avoid the busy (hot) loop when the Future executions are not yet complete.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Reduce Thread.sleep(1L) to 1 millisecond.

Closes spring-projects#2518
@jxblum
Copy link
Contributor

jxblum commented Sep 27, 2023

Technically, after further review and careful consideration, I should clarify that the Thread.sleep(..) is possibly useful when none of the Future results are complete yet.

Depending on the Future implementation, without a pause, the while loop (here) inside ClusterCommandExecutor.collectResults(..) could run hot and wind up being a "busy" loop.

The new implementation is here, along with this.

I also reduced the Thread.sleep(..) to 1 millisecond.

NOTE: I could have chosen to use a TimeUnit.timedWait(..) (Javadoc), thereby reducing the latency even further to MICROSECONDS (e.g. TimeUnit.MICROSECONDS.timedWait(obj, 10)) or even NANOSECONDS, but that requires a synchronized block on an Object monitor (or an object's intrinsic lock). This would limit scalability in the context of Virtual Threads.

@jxblum
Copy link
Contributor

jxblum commented Sep 27, 2023

On second thought, recalling earlier thoughts I had, I do think there is a way to remove the Thread.sleep(..) call completely.

I switched PR #2719 to back DRAFT mode for the time being.

jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 27, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) call for 1 millisecond. As such, the Future.isDone() (and Future.isCancelled()) calls are no longer necessary. Simply try to get the results, and if a TimeoutException is thrown, then simply set done to false.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 27, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false. 10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the compuptation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/requst routing etc. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time while waiting for other nodes to complete.

10 microseconds was partially arbitrary, but not anymore so than Thread.sleep(10L) (10 millisconds). The objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is "done", hence Future.get(timeout, TimeUnit.MICROSECONDS), or sub-millisecond response times. This may need to be futher tuned over time, but should serve as a reasonable baseline for the time being. This was also based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, "recommending" a cluster size of no more than 1000 nodes.

One optimimzation could be to reorder the Map of Futures at the end of each iteration to organize Futures that are done. Furthermore, Futures that have already completed could be removed from the Map. Of course, there is little harm in keep the future in the Map witht the "safeguard" in place. This was not included in the changes simply because the optimization is most likely neglible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 27, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false. 10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the compuptation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/requst routing etc. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time while waiting for other nodes to complete.

10 microseconds was partially arbitrary, but not anymore so than Thread.sleep(10L) (10 millisconds). The objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is "done", hence Future.get(timeout, TimeUnit.MICROSECONDS), or sub-millisecond response times. This may need to be futher tuned over time, but should serve as a reasonable baseline for the time being. This was also based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, "recommending" a cluster size of no more than 1000 nodes.

One optimimzation might be to reorder the Map of Futures at the end of each iteration organizing Futures that are done first. Furthermore, Futures that have already completed could be removed from the Map. Of course, there is little harm in keeping the future in the Map with the "safeguard" in place. This optimization was not included in the changes simply because the optimization is most likely neglible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory consumption should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
@jxblum
Copy link
Contributor

jxblum commented Sep 27, 2023

Alright, PR #2719 re-opened (no longer in draft mode).

I also modified the PR to now remove Thread.sleep(..) completely! You can read the commit message for the reasoning behind the current thinking and set of changes.

jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 27, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false. 10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the compuptation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/requst routing etc. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time while waiting for other nodes to complete.

10 microseconds was partially arbitrary, but not anymore so than Thread.sleep(10L) (10 millisconds). The objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is "done", hence Future.get(timeout, TimeUnit.MICROSECONDS), or sub-millisecond response times. This may need to be futher tuned over time, but should serve as a reasonable baseline for the time being. This was also based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, "recommending" a cluster size of no more than 1000 nodes.

One optimimzation might be to reorder the Map of Futures at the end of each iteration organizing Futures that are done first. Furthermore, Futures that have already completed could be removed from the Map. Of course, there is little harm in keeping the future in the Map with the "safeguard" in place. This optimization was not included in the changes simply because the optimization is most likely neglible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory consumption should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 29, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false.

10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the computation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/request routing, and so on. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time when waiting on a single node to complete.

10 microseconds was partially arbitrary, but no more so than Thread.sleep(10L) (10 milliseconds). The main objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is done, hence Future.get(timeout, TimeUnit.MICROSECONDS) for sub-millisecond response times. This may need to be further tuned over time, but should serve as a reasonable baseline for the time being. Additionally, this was based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, recommending a cluster size of no more than 1000 nodes.

One optimization might be to reorder the Map of Futures at the end of each iteration by organizing Futures that are done first. Furthermore, Futures that have already completed could even be removed from the Map. Of course, there is little harm in keeping the completed Futures in the Map with the safeguard in place. This optimization was not included in theses changes simply because the optimization is most likely negligible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory consumption should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 29, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false.

10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the computation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/request routing, and so on. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time when waiting on a single node to complete.

10 microseconds was partially arbitrary, but no more so than Thread.sleep(10L) (10 milliseconds). The main objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is done, hence Future.get(timeout, TimeUnit.MICROSECONDS) for sub-millisecond response times. This may need to be further tuned over time, but should serve as a reasonable baseline for the time being. Additionally, this was based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, recommending a cluster size of no more than 1000 nodes.

One optimization might be to reorder the Map of Futures at the end of each iteration by organizing Futures that are done first. Furthermore, Futures that have already completed could even be removed from the Map. Of course, there is little harm in keeping the completed Futures in the Map with the safeguard in place. This optimization was not included in theses changes simply because the optimization is most likely negligible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory consumption should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 29, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false.

10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the computation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/request routing, and so on. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time when waiting on a single node to complete.

10 microseconds was partially arbitrary, but no more so than Thread.sleep(10L) (10 milliseconds). The main objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is done, hence Future.get(timeout, TimeUnit.MICROSECONDS) for sub-millisecond response times. This may need to be further tuned over time, but should serve as a reasonable baseline for the time being. Additionally, this was based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, recommending a cluster size of no more than 1000 nodes.

One optimization might be to reorder the Map of Futures at the end of each iteration by organizing Futures that are done first. Furthermore, Futures that have already completed could even be removed from the Map. Of course, there is little harm in keeping the completed Futures in the Map with the safeguard in place. This optimization was not included in theses changes simply because the optimization is most likely negligible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory consumption should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Sep 29, 2023
Replace Thread.sleep(..) with Future.get(timeout, :TimeUnit) for 10 microseconds. As a result, Future.isDone() and Future.isCancelled() are no longer necessary. Simply try to get the results within 10 us, and if a TimeoutException is thrown, then set done to false.

10 microseconds is 1/1000 of 10 milliseconds. This means a Redis cluster with 1000 nodes will run in a similar time to Thread.sleep(10L) if all Futures are blocked waiting for the computation to complete and take an equal amount of time to compute the result, which is rarely the case in practice, given different hardware configurations, data access patterns, load balancing/request routing, and so on. However, using Future.get(timeout, :TimeUnit) is more fair than Future.get(), which blocks until a result is returned or an ExecutionException is thrown, thereby starving computationally faster nodes vs. other nodes in the cluster that might be overloaded. In the meantime, some nodes may even complete in the short amount of time when waiting on a single node to complete.

10 microseconds was partially arbitrary, but no more so than Thread.sleep(10L) (10 milliseconds). The main objective was to give each node a chance to complete the computation in a moments notice balanced with the need to quickly check if the computation is done, hence Future.get(timeout, TimeUnit.MICROSECONDS) for sub-millisecond response times. This may need to be further tuned over time, but should serve as a reasonable baseline for the time being. Additionally, this was based on https://redis.io/docs/reference/cluster-spec/#overview-of-redis-cluster-main-components in the Redis documentation, recommending a cluster size of no more than 1000 nodes.

One optimization might be to reorder the Map of Futures at the end of each iteration by organizing Futures that are done first. Furthermore, Futures that have already completed could even be removed from the Map. Of course, there is little harm in keeping the completed Futures in the Map with the safeguard in place. This optimization was not included in theses changes simply because the optimization is most likely negligible and should be measured. Reconstructing a TreeMap should run mostly within log(n) time, but memory consumption should also be taken into consideration.

Add test coverage for ClusterCommandExecutor collectResults(..) method.

Cleanup compiler warnings in ClusterCommandExecutorUnitTests.

Closes spring-projects#2518
mp911de added a commit that referenced this issue Oct 11, 2023
Simplify tests. Reuse existing interfaces from Spring. Remove inappropriate nullability annotations and introduce annotations where required.

Replace Future mocking with easier to maintain and to read future method overrides. Remove superfluous code and replace with infrastructure classes provided by Spring Framework.

Consistently name callbacks. Make exception collector concept explicit. Reformat code.

See #2518
Original pull request: #2719
@mp911de mp911de added type: task A general task and removed status: on-hold We cannot start working on this issue yet labels Oct 11, 2023
@mp911de mp911de modified the milestone: 3.1.5 (2023.0.5) Oct 11, 2023
@mp911de mp911de added this to the 3.2 RC1 (2023.1.0) milestone Oct 11, 2023
jxblum added a commit to jxblum/spring-data-redis that referenced this issue Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core support type: task A general task
Projects
None yet
5 participants