-
Notifications
You must be signed in to change notification settings - Fork 102
Scaling Cloud Controller
The main cloud controller API job is the cloud_controller_ng ruby server process.
-
cc.requests.outstanding
is at or consistently near 20 -
system.cpu.user
is above 0.85 utilization of a single core on the api vm (see note on bosh cpu values) -
cc.vitals.cpu_load_avg
is 1 or higher -
cc.vitals.uptime
is consistently low indicating frequent restarts (possibly due to memory pressure)
- Average response latency
- Degraded web UI responsiveness or timeouts
Before and after scaling Cloud Controller API VMs, its important to verify that the CC's database is not overloaded. All Cloud Controller processes are backed by the same database (ccdb), so heavy load on the database will impact API performance regardless of the number of Cloud Controllers deployed.
In CF deployments with internal MySQL clusters, 1 MySQL database VM with CPU usage over 75% can be considered overloaded. When this happens, the MySQL VMs must be scaled up or the added load of additional Cloud Controllers will exacerbate the problem.
Cloud Controllers API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not helpful because of Ruby's Global Interpreter Lock (GIL). This limits the cloud_controller_ng process so that it can only effectively use a single CPU core on a multi-core machine.
If the Cloud Controller is appropriately provisioned in terms of CPU
Colloquially known as "local workers," this job is primarily responsible for handling app package bits uploaded to the API VMs during cf push
.
-
cf push
success rate -
cf push
average time
Job Name What Do Relevant Metrics How to Scale cloud_controller_ng API cc.requests.outstanding system.cpu.user Database performance
Horizontally - Scale Out cloud_controller_worker_local_# File uploads (packages, droplets, buildpacks). In other words, these handle cf push. cf push times/failures cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> cc.job_queue_length.total Horizontally scale API vms. Property that controls # of local workers per API vm is not configurable in all deployments (maybe it should be?) cloud_controller_clock Runs Diego sync process. Schedules periodic background jobs. cc.Diego_sync.duration
bbs.Domain.cf-apps bbs.Domain.cf-tasks It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. cloud_controller_worker_# Runs background jobs such as delete operations and jobs scheduled by the cloud_controller_clock. cc.job_queue_length.cc-generic cc.job_queue_length.total Horizontally scale worker vms. Number of worker processes per vm is not configurable in all deployments, but can be configured via capi-release. cc_deployment_updater Handles rolling app deployments cc.deployments.update.duration cc.deployments.deploying It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. tps-watcher Looks for app crash events from Diego and informs Cloud Controller of them to create crash events None It is a singleton, so horizontally scaling is for HA and not for performance. Typically not scaled. cc-uploader Proxies staged droplets from Diego to Cloud Controller None Already scaled with API vms blobstore_nginx Internal WebDav blobstore cf push times/failures system.cpu.user It is a singleton and is not HA. Can be scaled vertically if under high CPU load. Does a lot of filesystem operations so disk latency can impact performance.