Scaling Cloud Controller

UNDER CONSTRUCTION

Scaling Cloud Controller

Jobs [cf-deployment instance group]

cloud_controller_ng [api]

The main cloud controller API job is the cloud_controller_ng ruby server process.

When to scale

Key Metrics

cc.requests.outstanding is at or consistently near 20
system.cpu.user is above 0.85 utilization of a single core on the api vm (see note on bosh cpu values)
cc.vitals.cpu_load_avg is 1 or higher
cc.vitals.uptime is consistently low indicating frequent restarts (possibly due to memory pressure)

Other Heuristics

Average response latency
Degraded web UI responsiveness or timeouts

How to scale

Before and after scaling Cloud Controller API VMs, its important to verify that the CC's database is not overloaded. All Cloud Controller processes are backed by the same database (ccdb), so heavy load on the database will impact API performance regardless of the number of Cloud Controllers deployed.

In CF deployments with internal MySQL clusters, 1 MySQL database VM with CPU usage over 75% can be considered overloaded. When this happens, the MySQL VMs must be scaled up or the added load of additional Cloud Controllers will exacerbate the problem.

Cloud Controllers API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not helpful because of Ruby's Global Interpreter Lock (GIL). This limits the cloud_controller_ng process so that it can only effectively use a single CPU core on a multi-core machine.

If the Cloud Controller is appropriately provisioned in terms of CPU

cloud_controller_worker_local [api]

Colloquially known as "local workers," this job is primarily responsible for handling app package bits uploaded to the API VMs during cf push.

Heuristics

cf push is intermittently failing
cf push average time is elevated

Metrics

cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> ie cc.job_queue_length.cc-api-<VM_INDEX> is continuously growing
cc.job_queue_length.total is continuously growing

How to scale

Because these are colocated with the Cloud Controller API job, they should be scaled horizontally with the API.

cloud_controller_worker [cc-worker]

Colloquially known as "workers", this job (and VM) is responsible for handling asynchronous work, batch deletes, and other things scheduled by the clock.

Heuristics

cf delete-org ORG_NAME appears to leave its contained resources around for a long time
Users report slow deletes for other resources
cf-acceptance-tests succeed generally, but fail during cleanup

Metrics

cc.job_queue_length.cc-<VM_TYPE>-<VM_INDEX> ie cc.job_queue_length.cc-cc-worker-<VM_INDEX> is continuously growing
cc.job_queue_length.total is continuously growing

How to scale

cc-worker can safely scale horizontally in all deployments, but if your worker VMs have CPU/memory headroom you can also increase cc.jobs.generic.number_of_workers to increase the number of worker processes on each cc-worker VM.

cloud_controller_clock and cc_deployment_updater [scheduler]

The clock runs Diego sync process and schedules periodic background jobs. The deployment updater scales up new processes during v3 rolling deployments and scales down old ones.

Heuristics

Diego domains are frequently unfresh
Your LRP count is larger than your process instance count
Deployments are slow to increase and decrease instance count

Metrics

cc.Diego_sync.duration is continuously increasing over time
system.cpu.user is above 75% on the scheduler VM

How to scale

Both of these jobs are singletons (only a single instance is active), so extra instances are for failover HA rather than scalability. Performance issues are likely due to database overloading or greedy neighbors on the scheduler VM.

blobstore_nginx

Job Name What Do Relevant Metrics How to Scale cloud_controller_ng API cc.requests.outstanding system.cpu.user Database performance

Horizontally - Scale Out cloud_controller_worker_local_# File uploads (packages, droplets, buildpacks). In other words, these handle cf push. cf push times/failures cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> cc.job_queue_length.total Horizontally scale API vms. Property that controls # of local workers per API vm is not configurable in all deployments (maybe it should be?) cloud_controller_clock Runs Diego sync process. Schedules periodic background jobs. cc.Diego_sync.duration

bbs.Domain.cf-apps bbs.Domain.cf-tasks It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. cloud_controller_worker_# Runs background jobs such as delete operations and jobs scheduled by the cloud_controller_clock. cc.job_queue_length.cc-generic cc.job_queue_length.total Horizontally scale worker vms. Number of worker processes per vm is not configurable in all deployments, but can be configured via capi-release. cc_deployment_updater Handles rolling app deployments cc.deployments.update.duration cc.deployments.deploying It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. tps-watcher Looks for app crash events from Diego and informs Cloud Controller of them to create crash events None It is a singleton, so horizontally scaling is for HA and not for performance. Typically not scaled. cc-uploader Proxies staged droplets from Diego to Cloud Controller None Already scaled with API vms

blobstore_nginx Internal WebDav blobstore cf push times/failures system.cpu.user It is a singleton and is not HA. Can be scaled vertically if under high CPU load. Does a lot of filesystem operations so disk latency can impact performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Cloud Controller

UNDER CONSTRUCTION

Scaling Cloud Controller

Jobs [cf-deployment instance group]

cloud_controller_ng [api]

When to scale

Key Metrics

Other Heuristics

How to scale

cloud_controller_worker_local [api]

Heuristics

Metrics

How to scale

cloud_controller_worker [cc-worker]

Heuristics

Metrics

How to scale

cloud_controller_clock and cc_deployment_updater [scheduler]

Heuristics

Metrics

How to scale

blobstore_nginx

Clone this wiki locally