-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fallback from shard awareness in case of overloaded coordinator shards #136
Comments
@DoronArazii Please, assign somebody to work on this. |
Thanks for raising this up @vladzcloudius @mykaul please consult with @avelanarius regard this enhancement. |
Sure. I'll let @avelanarius asses feasibility and priority. |
based on how many people use shard aware java driver (and to how many I told to not use it in case of shard overload) ... guys ... this is a P1 imho |
What would be the test scenario? A single overloaded shard (for example due to compaction)? Or someone else would do the actual benchmarking after we implemented it? |
@avelanarius yes, the compaction test would be good |
Simply overload a single shard with some CPU hogging work (some tight loop C application) and run a cassandra-stress read test with RF=3. Without a patch a corresponding amount of requests will have a higher latency due to queuing on that shard - foreground reads would go as high as you would allow: if you set threads=N then on that shard it will go as high as N. With the fix the driver is expected to start steering reads to other shards after some threshold (AFAIR half of the max_requests_per_connection which is controlled by maxPending argument of cassandra-stress AFAIR). |
An update on the issue: @Gor027 will start with reproducing the issue (in a clean way!) - a very similar scenario @vladzcloudius described - starting a X shard node, overloading one shard and testing sending queries that touch data in the overloaded shard. Next, test how well it performs if we always send the queries to the correct (overloaded) shard, always send data to the wrong (non-overloaded) shard. I think it'll be useful to do a "semi-deep" dive analyzing what actually happens in Scylla. |
The attempt to reproduce the issue was as @avelanarius described above:
The results have shown that sending to a wrong shard which is relatively not overloaded doubles the throughput. A possible explanation is that, as the tracing of the query execution shows, when the query is sent to the wrong shard it parses and processes the statement instead of immediately redirecting the query to the correct shard which holds the data. So, some part of the query execution is done by a non-overloaded shard which is not the case with enabled shard awareness. |
Great. Thanks for the update, @Gor027.
Please, see the opening message of this GH issue. It references the already merged solution for this issue in the Scylla GoCQL. Note that @avelanarius asked you to run this testing in the context of a request of implementing the same in Java (or every other Scylla driver). |
Yes, it is already merged, but it was merged without any proper testing (not merged by me) - so it would still require some tuning. As this tuning could be tricky and time consuming, we decided to first try @fee-mendes safer idea - in driver load balancing policy pick a node+shard that has the least number of inflight requests. However, we are not saying that this is the only solution we will implement and try out. |
I don't understand your point, @avelanarius |
The difference is as follows. Suppose we have table in a keyspace with RF=3. A INSERT query is executed. 3 nodes can become coordinators: A (shard a), B (shard b), C (shard c). The GoCQL optimization (as I understand it) works that if we decide to send a query to A (shard a), but this shard (connection to it) is overloaded, we will instead send the query to A (shard leastBusyOnA). We wouldn’t change the node to send at that moment. We would send the query to a wrong shard, resulting in cross-shard ops (not saying this is bad). The optimization that I described would kick in earlier - at the moment of choosing which replica to send the query to. We would sort the replica+shard by the numbers of inflight requests. If A+shard a was overloaded, we would instead choose B (shard b) for example (whichever had the least number of inflight requests (keeping the count per each shard in each node)). Shard b on node B would be the “correct” shard, not resulting in cross-shard ops. |
I see. I assumed that the optimization you are describing is already the way the coordinator is chosen. If it's not the case this is an orthogonal issue that needs to be resolved regardless. The optimization in question (a topic of this GH issue) deals with the situation that may happen after the logic you have described has been applied. Let's not mix these things together - these two heuristics are meant to complete each other. However there are use cases when your heuristics alone won't help: e.g. in your example when the corresponding shard on all 3 replicas is overloaded, e.g. due to compactions running on all of them due to heavy write that wrote a lot of data into a single big partition. |
I now understand the confusion: Java Driver 4.x DOES take into account the number of in-flight requests in the load balancing policy. However, the existing implementation only compared the whole-node inflight numbers, not per-shard numbers, so I guess it isn't that sensitive to a single shard overload. Java Driver 3.x (used for example in cassandra-stress) has a naive round robin policy that doesn't take into account the in-flight request numbers. |
A new RandomTwoChoice policy is waiting to be merged in PR #198. The policy aims to add slight optimization in a way the coordinator node is being chosen when replicas are overloaded. It takes into account the current load of each replica and shard before sending a request. The benchmarks and the results are described in detail in the PR description. You can notice how well it performs when the cluster is partially overloaded. This, however, does not fully solve the problem described in this issue. When the cluster(all nodes/replicas) is overloaded due to high writes which causes a lot of background processing like compaction, hints, etc, although the throughput remains the same according to our benchmarks, the latencies may skyrocket and timeouts may occur. So, perhaps the issue merits a separate fix and benchmarks. |
Are you sure, @avelanarius? I do remember that 3.x Java drivers used a "shortest queue first" algorithm to pick a TCP connection to the same host in case there is a more than one such TCP connection - which was the case by default, and we usually made sure of that explicitly as well. |
I think two things are getting mixed up. The load-balancing policies in Java drivers (TokenAware, RoundRobin, etc.) only decide which nodes should handle the request. At present, those decisions are independent of the metrics about host/shard inflight requests in Java driver 3.x. However, when a particular node is chosen to handle the request, then the session manager chooses from the connection pool of that node/shard a TCP connection that has the lowest number of inflight requests. In Java driver 4.x on the other hand, the default load-balancing balancing policy takes into account the total number of inflight requests of a node when deciding which node should handle the request. The inflight request metric serves as a health check and nodes with fewer inflight requests are preferred. I think @avelanarius wanted to highlight that with a slight modification, the shard-aware driver will be capable to compare the inflight requests of the target shards of the nodes for a particular request, instead of comparing the sum of all inflight requests per node. It will allow the load-balancing policy to make better decisions and avoid sending requests to a replica with the target shard being overloaded. |
@Gor027 No need to go in circles - we have established the two stages of load balancing you are referring before: #136 (comment) This GH issue was about the load-balancing decision that is made after the target coordinator node has been chosen from the start. And, yes, if you want to improve the algorithm of choosing the coordinator node - it has to be a matter of a separate GH issue. Back to context of this GH issue: the idea is to choose to send a CQL request to a "wrong shard" on the chosen coordinator under certain conditions (see the opening message). |
This issue is about implementing a mechanism for falling back to picking least busy connections, even if we know the owner shard, in case we detect that a particular coordinator shard is severely overloaded. The idea is already implemented and explained here: scylladb/gocql#86
The whole discussion is worth reading, but specifically, in comment scylladb/gocql#86 (comment) @vladzcloudius requests implementing a similar behavior in the Java driver, so that he can conveniently test it with cassandra-stress. Hence, this issue :)
/cc @avelanarius does your team have spare cycles for implementing such a change, potentially based on existing PR for gocql linked above?
The text was updated successfully, but these errors were encountered: