-
Notifications
You must be signed in to change notification settings - Fork 102
Scaling Cloud Controller
The main cloud controller API job is the cloud_controller_ng ruby server process.
-
cc.requests.outstanding
is at or consistently near 20 -
system.cpu.user
is above 0.85 utilization of a single core on the api vm (see note on bosh cpu values) -
cc.vitals.cpu_load_avg
is 1 or higher -
cc.vitals.uptime
is consistently low indicating frequent restarts (possibly due to memory pressure)
- Average response latency
- Degraded web UI responsiveness or timeouts
Before and after scaling Cloud Controller API VMs, its important to verify that the CC's database is not overloaded. All Cloud Controller processes are backed by the same database (ccdb), so heavy load on the database will impact API performance regardless of the number of Cloud Controllers deployed.
In CF deployments with internal MySQL clusters, 1 MySQL database VM with CPU usage over 75% can be considered overloaded. When this happens, the MySQL VMs must be scaled up or the added load of additional Cloud Controllers will exacerbate the problem.
Cloud Controllers API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not helpful because of Ruby's Global Interpreter Lock (GIL). This limits the cloud_controller_ng process so that it can only effectively use a single CPU core on a multi-core machine.
If the Cloud Controller is appropriately provisioned in terms of CPU
Colloquially known as "local workers," this job is primarily responsible for handling app package bits uploaded to the API VMs during cf push
.
-
cf push
is intermittently failing -
cf push
average time is elevated
-
cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX>
iecc.job_queue_length.cc-api-<VM_INDEX>
is continuously growing -
cc.job_queue_length.total
is continuously growing
Because these are colocated with the Cloud Controller API job, they should be scaled horizontally with the API.
Colloquially known as "workers", this job (and VM) is responsible for handling asynchronous work, batch deletes, and other things scheduled by the clock.
-
cf delete-org ORG_NAME
appears to leave its contained resources around for a long time - Users report slow deletes for other resources
- cf-acceptance-tests succeed generally, but fail during cleanup
-
cc.job_queue_length.cc-<VM_TYPE>-<VM_INDEX>
iecc.job_queue_length.cc-cc-worker-<VM_INDEX>
is continuously growing -
cc.job_queue_length.total
is continuously growing
cc-worker can safely scale horizontally in all deployments, but if your worker VMs have CPU/memory headroom you can also increase cc.jobs.generic.number_of_workers
to increase the number of worker processes on each cc-worker VM.
The clock runs Diego sync process and schedules periodic background jobs. The deployment updater scales up new processes during v3 rolling deployments and scales down old ones.
- Diego domains are frequently unfresh
- Your LRP count is larger than your process instance count
- Deployments are slow to increase and decrease instance count
-
cc.Diego_sync.duration
is continuously increasing over time -
system.cpu.user
is above 75% on the scheduler VM
Both of these jobs are singletons (only a single instance is active), so extra instances are for failover HA rather than scalability. Performance issues are likely due to database overloading or greedy neighbors on the scheduler VM.
Job Name What Do Relevant Metrics How to Scale cloud_controller_ng API cc.requests.outstanding system.cpu.user Database performance
Horizontally - Scale Out cloud_controller_worker_local_# File uploads (packages, droplets, buildpacks). In other words, these handle cf push. cf push times/failures cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> cc.job_queue_length.total Horizontally scale API vms. Property that controls # of local workers per API vm is not configurable in all deployments (maybe it should be?) cloud_controller_clock Runs Diego sync process. Schedules periodic background jobs. cc.Diego_sync.duration
bbs.Domain.cf-apps bbs.Domain.cf-tasks It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. cloud_controller_worker_# Runs background jobs such as delete operations and jobs scheduled by the cloud_controller_clock. cc.job_queue_length.cc-generic cc.job_queue_length.total Horizontally scale worker vms. Number of worker processes per vm is not configurable in all deployments, but can be configured via capi-release. cc_deployment_updater Handles rolling app deployments cc.deployments.update.duration cc.deployments.deploying It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. tps-watcher Looks for app crash events from Diego and informs Cloud Controller of them to create crash events None It is a singleton, so horizontally scaling is for HA and not for performance. Typically not scaled. cc-uploader Proxies staged droplets from Diego to Cloud Controller None Already scaled with API vms
blobstore_nginx Internal WebDav blobstore cf push times/failures system.cpu.user It is a singleton and is not HA. Can be scaled vertically if under high CPU load. Does a lot of filesystem operations so disk latency can impact performance.