-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detected performance impact query #15
Comments
@avelanarius Connecting the ScyllaDB CDC source connector will cause too many queries to ScyllaDB, causing spikes in CPU load and facing timeouts and failures. I suspect this issue is caused by a query pattern. |
Hi @pkgonan Please see my answers below.
This is not exactly correct. All the stream ids used for a single query should live on the same set of replicas so if the RF = 3. A select should affect only 3 nodes in the cluster.
It's not immediately obvious that splitting the queries will improve the performance. On one hand, all partition keys used (stream ids) should leave on the same set of nodes so grouping them should actually help performance. On the other hand, those partition keys (stream ids) can leave on different shards and it might turn out that splitting the queries and sending them one by one, each to the right shard will be more efficient. If from your experiments splitting the queries helps then we can certainly do that.
What was the reason of the failure? Was the cluster just overloaded? At the same time, could you please share more context on your use case @pkgonan ? That would make it easier for us to understand the problem. Thanks, |
@haaawk Hi. So, when I register the ScyllaDB Source Connector with the Kafka Connect Cluster, the read requests increase dramatically. As a result, the load of ScyllaDB Nodes increased rapidly and the CPU increased. Systems using ScyllaDB have crashed due to ScyllaDB read and write timeouts. (See, read & write latency at the images registered in the body) If you look at the images registered in the body, you can see that the number of read requests has skyrocketed. ( |
It is somehow expected to see the increase in the number of reads. I see that you have 24 nodes meaning that there will be 24 * 256 * |
@haaawk I remember that when I registered ScyllaDB Source Connectors in the Kafka Connect Cluster in the past, the read requests increased several times for a certain period of time at the beginning, and the read requests returned to the normal level after a certain period of time. Based on this, I suspect there may be a problem with the source code making the read request when first registering the ScyllaDB Source Connector. |
That's a very good point. We're usually trying to select a bit quicker at the beginning to catch up but maybe we do this too aggressively. @avelanarius what do you think? |
@haaawk @avelanarius I need to register at least 3 additional ScyllaDB Source Connectors. However, due to this problem, we are no longer able to register the ScyllaDB Source Connector. If I additionally register the ScyllaDB Source Connector, the load on ScyllaDB increases rapidly and failures occur in the systems using ScyllaDB. Therefore, i could not register the new ScyllaDB Source Connector and was waiting for an answer to be registered in the current issue. Thank you for your quick reply. |
@pkgonan Thanks for sharing. @avelanarius Could you please look at making the startup of the connector less demanding for Scylla? Why do you need more than one ScyllaDB Source Connector, @pkgonan? Are you putting data on different topics? If so maybe we could change the connector to allow that with a single instance? That would greatly reduce the traffic to Scylla. |
@haaawk Therefore, each microservice must apply the ScyllaDB Source Connector for each keyspace it uses. 1 Microservice == 1 Keyspace == 1 ScyllaDB Source Connector Each ScyllaDB Source Connector`s tasks.max is 3. We can change to 1. (tasks.max 3 -> 1) |
I see. |
As I understand from my own debugging - the initial high load comes from the connector 'catching up' from beginning of the CDC table log, is this right? Based on the previous calculations
I think it might result in another multiplier 24h / @haaawk Is there some logic reading the table CDC TTL (24h default) and backdating the connector 'start time'? |
That would be my understanding.
Yes. The connector assumes there might be data as old as |
@haaawk @hartmut-co-uk |
It might happen that we do the queries anyway even though they return nothing. We should optimize this case. @avelanarius could you please look at this issue? |
|
This can be done easily but will require the user not to send any traffic until the connector is running. Another option is to specify in config how far back the connector should start.
This may be problematic if there's a lot of data accumulated. I guess paging would do the trick so we could try this as well. |
Regarding I can see the original post uses
Where I was testing/evaluating even more aggressively with e.g.
trying to get this latency Once the connector is running and caught up it seemed somewhat stable and OK - but as mentioned - when started anew or after some downtime - it's a huge spike in utilisation (even if there's literally no data in the cdc log table...). I guess what I'm trying to say is - configuring a small window size (e.g. 2s) results in a massive number of queries... |
At the moment, due to the eventual consistency nature of Scylla we need to keep the confidence window big enough to allow delayed writes to reach the replicas before we query for the window that contains those writes. It is not recommended to reduce the confidence window size as it can lead to some updates being missed. In the future with a consistent raft-based tables we will be able to relax this restriction. |
🙇 Oh! Great to hear raft will also be beneficial for this use case. |
I'm not sure. Raft is being developed by another team and I'm not even sure there is an official release date already announced. |
added #17 |
Raft is enabled by default - for schema changes - in 5.2. Consistent tables is still in the backlog, but we are moving forward (after migrating topology changes and so on to the Raft). |
In scylla-cdc-source-connector, the following query pattern is used to retrieve the changed data from the CDC table.
"cdc$stream_id" IN ?
The query pattern seems to be sending a lookup request to all the ScyllaDB Cluster Nodes. So ScyllaDB will see spikes in load and crash.
How about improving these query patterns by looking up and merging them in parallel inside the CDC Connector?
Currently, ScyllaDB has failed after attaching the Source Connector in the production environment.
[ScyllaDB Cluster Environment]
[Scylla CDC Source Connector Configuration]
The text was updated successfully, but these errors were encountered: