Skip to content

Scaling Cloud Controller

braa braa braa edited this page Mar 21, 2019 · 8 revisions

UNDER CONSTRUCTION

Scaling Cloud Controller

Jobs [cf-deployment instance group]

cloud_controller_ng [api]

The main cloud controller API job is the cloud_controller_ng ruby server process.

When to scale

Key Metrics
  • cc.requests.outstanding is at or consistently near 20
  • system.cpu.user is above 0.85 utilization of a single core on the api vm (see note on bosh cpu values)
  • cc.vitals.cpu_load_avg is 1 or higher
  • cc.vitals.uptime is consistently low indicating frequent restarts (possibly due to memory pressure)
Other Heuristics
  • Average response latency
  • Degraded web UI responsiveness or timeouts

How to scale

Before and after scaling Cloud Controller API VMs, its important to verify that the CC's database is not overloaded. All Cloud Controller processes are backed by the same database (ccdb), so heavy load on the database will impact API performance regardless of the number of Cloud Controllers deployed.

In CF deployments with internal MySQL clusters, 1 MySQL database VM with CPU usage over 75% can be considered overloaded. When this happens, the MySQL VMs must be scaled up or the added load of additional Cloud Controllers will exacerbate the problem.

Cloud Controllers API VMs should primarily be scaled horizontally. Scaling up the number of cores on a single VM is not helpful because of Ruby's Global Interpreter Lock (GIL). This limits the cloud_controller_ng process so that it can only effectively use a single CPU core on a multi-core machine.

If the Cloud Controller is appropriately provisioned in terms of CPU

cloud_controller_worker_local [api]

Colloquially known as "local workers," this job is primarily responsible for handling app package bits uploaded to the API VMs during cf push.

Heuristics
  • cf push is intermittently failing
  • cf push average time is elevated
Metrics
  • cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> ie cc.job_queue_length.cc-api-<VM_INDEX> is continuously growing
  • cc.job_queue_length.total is continuously growing

How to scale

Because these are colocated with the Cloud Controller API job, they should be scaled horizontally with the API.

cloud_controller_worker [cc-worker]

Colloquially known as "workers", this job (and VM) is responsible for handling asynchronous work, batch deletes, and other things scheduled by the clock.

Heuristics
  • cf delete-org ORG_NAME appears to leave its contained resources around for a long time
  • Users report slow deletes for other resources
  • cf-acceptance-tests succeed generally, but fail during cleanup
Metrics
  • cc.job_queue_length.cc-<VM_TYPE>-<VM_INDEX> ie cc.job_queue_length.cc-cc-worker-<VM_INDEX> is continuously growing
  • cc.job_queue_length.total is continuously growing

How to scale

cc-worker can safely scale horizontally in all deployments, but if your worker VMs have CPU/memory headroom you can also increase cc.jobs.generic.number_of_workers to increase the number of worker processes on each cc-worker VM.

cloud_controller_clock and cc_deployment_updater [scheduler]

The clock runs Diego sync process and schedules periodic background jobs. The deployment updater scales up new processes during v3 rolling deployments and scales down old ones.

Heuristics
  • Diego domains are frequently unfresh
  • Your LRP count is larger than your process instance count
  • Deployments are slow to increase and decrease instance count
Metrics
  • cc.Diego_sync.duration is continuously increasing over time
  • system.cpu.user is above 75% on the scheduler VM

How to scale

Both of these jobs are singletons (only a single instance is active), so extra instances are for failover HA rather than scalability. Performance issues are likely due to database overloading or greedy neighbors on the scheduler VM.

blobstore_nginx

Job Name What Do Relevant Metrics How to Scale cloud_controller_ng API cc.requests.outstanding system.cpu.user Database performance

Horizontally - Scale Out cloud_controller_worker_local_# File uploads (packages, droplets, buildpacks). In other words, these handle cf push. cf push times/failures cc.job_queue_length.cc-<VM_NAME>-<VM_INDEX> cc.job_queue_length.total Horizontally scale API vms. Property that controls # of local workers per API vm is not configurable in all deployments (maybe it should be?) cloud_controller_clock Runs Diego sync process. Schedules periodic background jobs. cc.Diego_sync.duration

bbs.Domain.cf-apps bbs.Domain.cf-tasks It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. cloud_controller_worker_# Runs background jobs such as delete operations and jobs scheduled by the cloud_controller_clock. cc.job_queue_length.cc-generic cc.job_queue_length.total Horizontally scale worker vms. Number of worker processes per vm is not configurable in all deployments, but can be configured via capi-release. cc_deployment_updater Handles rolling app deployments cc.deployments.update.duration cc.deployments.deploying It is a singleton, so horizontally scaling is for HA and not for performance. Performance issues are likely due to database overloading. tps-watcher Looks for app crash events from Diego and informs Cloud Controller of them to create crash events None It is a singleton, so horizontally scaling is for HA and not for performance. Typically not scaled. cc-uploader Proxies staged droplets from Diego to Cloud Controller None Already scaled with API vms

blobstore_nginx Internal WebDav blobstore cf push times/failures system.cpu.user It is a singleton and is not HA. Can be scaled vertically if under high CPU load. Does a lot of filesystem operations so disk latency can impact performance.

Clone this wiki locally