-
Notifications
You must be signed in to change notification settings - Fork 601
Build the cluster workload model
Cruise Control uses utilization of four resources to build its workload model.
Resource | Granularity of Raw Metric |
---|---|
CPU | Per broker |
Disk Utilization | Per replica |
Bytes In Rate | All the leader replicas of a topic per broker |
Bytes Out Rate | All the leader replicas of a topic per broker |
Because Cruise Control needs the metric sample granularity to be at replica level to build a cluster model, we derive the raw metric of CPU and network to get the replica level metrics.
The algorithm we use are the following:
Replica_Bytes_In_Rate = Topic_Bytes_In_Rate_On_Broker / Num_Partition_Of_The_Topic_On_Broker
Replica_Bytes_Out_Rate = Topic_Bytes_In_Rate_On_Broker / Num_Partition_Of_The_Topic_On_Broker
CPU utilization is more complicated to derive. We are using a bytes-based load model with coefficients. Cruise Control supports two ways to derive the CPU utilization.
- Heuristic Static Model
- Linear Regression Model
Unlike the linear regression model, heuristic static model does not require any training. It uses predefined weights for leader bytes in rate, leader bytes out rate and follower bytes in rate, respectively.
a - CPU contribution weight for Leader_Bytes_In_Rate
b - CPU contribution weight for Leader_Bytes_Out_Rate
c - CPU contribution weight for Follower_Bytes_In_Rate
Partition_Leader_CPU_Utilization = Broker_CPU_Utilization * (a * Partition_Leader_Bytes_In_Rate + b * Partition_Leader_Bytes_Out_Rate) / (a * Leader_Bytes_In_Rate_On_Broker + b * Leader_Bytes_Out_rate_On_Broker + c * Follower_Bytes_In_Rate_On_Broker)
Partition_Follower_CPU_Utilization = Broker_CPU_Utilization * (c * Partition_Follower_Bytes_In_Rate) / (a * Leader_Bytes_In_Rate_On_Broker + b * Leader_Bytes_Out_rate_On_Broker + c * Follower_Bytes_In_Rate_On_Broker)
The linear regression model requires user to train the model by triggering the training before bootstrapping or filling the samples into the load windows. It demands the training samples to be diverse enough in order to get accurate coefficients.
Broker_CPU_Utilization = a * Leader_Bytes_In_Rate_On_Broker + b * Leader_Bytes_Out_rate_On_Broker + c * Follower_Bytes_In_Rate_On_Broker
Where a, b and c are the coefficients that we will build based on historical data. Notice that a, b and c are heavily depending on the following configurations of the broker:
- Compression Codec
- Whether SSL is enabled or not. So the three coefficient will change based on the configuration.
Once we get the value of a, b and c, the replica cpu utilization can be simply calculated as below:
Leader_Replica_Load = a * Replica_Bytes_In_Rate + b * Replica_Bytes_Out_Rate
Follower_Replica_Load = c * Replica_Bytes_In_Rate
The CPU utilization may be affected by things other than bytes in and out rate such as topic creation / deletion, GC, etc. But typically those operations are taking much less resources and not putting continuous pressure on CPU. The impact of such tasks will be amortized to almost negligible if we look at a long enough period.
Log compaction is another contribution to CPU Utilization, but given that it is also pretty much based on the total bytes produced, by looking at the bytes in rate we have already factored that in.
So far we assumed the CPU utilization is associated only with throughput. This is likely over-simplified. We should collect more metrics to take ProduceRequestRate, ConsumerRequestRate and average request size into consideration. That also means we need to collect per partition request rate as well.
Most of the metrics samples collected fall in a narrow range of CPU utilization (e.g. 40% - 60%), a naive regression will cause the model to be biased due to the heavyweight of the samples in this CPU utilization range, thus cause inaccurate estimation when the CPU utilization falls out of this range.
The solution to this issue is to selectively choose the samples to remove the bias introduced by the uneven sample distribution.
In practice, a Kafka cluster may have a pretty stable BytesIn to BytesOut rate and that may cause the parameter observation matrix to become close to singular and dramatically impact the accuracy of the Linear Regression Model.
The solution to this is to again select the samples to ensure enough representatives of different BytesIn to BytesOut ratio in the samples. Inaccurate Micro Estimation The linear regression model was derived from the broker level metrics. When we want to apply this model to each single partition, it might not be that accurate anymore.
The solution is only apply the model to the broker, i.e. estimate the CPU utilization of the broker after subtract the IO load of the partition, as opposed to computing the CPU utilization of partition first and subtract it from the broker CPU utilization.
Contents
- Cruise Control Wiki
- Overview
- Troubleshooting
- User Guide
- Python Client
- Developer Guide