Skip to content

Scaling Cloud Controller

Tim Downey edited this page Mar 20, 2019 · 8 revisions

UNDER CONSTRUCTION

Scaling Cloud Controller

Jobs [cf-deployment instance group]

cloud_controller_ng [api]

The main cloud controller API job is the cloud_controller_ng ruby server process.

When to scale

Key Metrics
  • cc.requests.outstanding is at or consistently near 20
  • system.cpu.user is above 0.85 utilization of a single core on the api vm (see note on bosh cpu values)
  • cc.vitals.cpu_load_avg is 1 or higher
  • cc.vitals.uptime is consistently low indicating frequent restarts (possibly due to memory pressure)
Other Heuristics
  • Average response latency
  • Degraded web UI responsiveness or timeouts

How to scale

Before and after scaling Cloud Controller API VMs, its important to verify that the CC's database is not overloaded. All Cloud Controller processes are backed by the same database (ccdb), so heavy load on the database will impact API performance regardless of the number of Cloud Controllers deployed.

In CF deployments with internal MySQL clusters, 1 MySQL database VM with CPU usage over 75% can be considered overloaded. When this happens, the MySQL VMs must be scaled up or the added load of additional Cloud Controllers will exacerbate the problem.

Cloud Controllers API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not helpful because of Ruby's Global Interpreter Lock (GIL). This limits the cloud_controller_ng process so that it can only effectively use a single CPU core on a multi-core machine.

If the Cloud Controller is appropriately provisioned in terms of CPU

cloud_controller_worker_local [api]

Colloquially known as "local workers," this job is primarily responsible for handling app package bits uploaded to the API VMs during cf push.

Scratch Area

Other Heuristics
  • cf push success rate
  • cf push average time

Job Name What Do Relevant Metrics How to Scale cloud_controller_ng API cc.requests.outstanding system.cpu.user Database performance

Horizontally - Scale Out cloud_controller_worker_local_# File uploads (packages, droplets, buildpacks). In other words, these handle cf push. cf push times/failures cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> cc.job_queue_length.total Horizontally scale API vms. Property that controls # of local workers per API vm is not configurable in all deployments (maybe it should be?) cloud_controller_clock Runs Diego sync process. Schedules periodic background jobs. cc.Diego_sync.duration

bbs.Domain.cf-apps bbs.Domain.cf-tasks It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. cloud_controller_worker_# Runs background jobs such as delete operations and jobs scheduled by the cloud_controller_clock. cc.job_queue_length.cc-generic cc.job_queue_length.total Horizontally scale worker vms. Number of worker processes per vm is not configurable in all deployments, but can be configured via capi-release. cc_deployment_updater Handles rolling app deployments cc.deployments.update.duration cc.deployments.deploying It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. tps-watcher Looks for app crash events from Diego and informs Cloud Controller of them to create crash events None It is a singleton, so horizontally scaling is for HA and not for performance. Typically not scaled. cc-uploader Proxies staged droplets from Diego to Cloud Controller None Already scaled with API vms blobstore_nginx Internal WebDav blobstore cf push times/failures system.cpu.user It is a singleton and is not HA. Can be scaled vertically if under high CPU load. Does a lot of filesystem operations so disk latency can impact performance.

Clone this wiki locally