Theory about improving user node's utilization #3309

consideRatio · 2023-10-24T07:15:56Z

I ended up thinking about improving user node's utilization, and ended up with some thoughts I figured was worth writing down.

Let's open dedicated issues for related action points we may come up with to better focus on the theory in this issue.

Too high remainder capacity

The "remainder capacity" is the unscheduled capacity of a node pools current nodes with no scheduled users. A node pool's remainder capacity is likely the biggest cost driver for wasted capacity, and increases with larger node sizes.

Reducing the remainder capacity can be done by:

using smaller node sizes
decreasing segregation of users (to reduce community specific node pools for
example)
making users session durations shorter by more aggressive culling

Badly tuned resource requests/limits

Badly tuned resource requests/limits will drive costs.

Resource requests/limits for memory is most complicated to tune as the consequences of running out of memory on a node leads to termination of the user server exceeding its resource requests by the largest relative amount.

To request more memory than used at any given time.
Then its trivially a too large request.
Not oversubscribing well enough
To not oversubscribe well enough examplified with memory is to request memory too close to the memory limit, and too far above the memory used on average. The extreme case is to have the requests equal the limit.

To avoid running out of memory on a node at any given time, user server's must at least request more memory than their average use, otherwise the node is mathematically guaranteed to run out of memory when fully scheduled based on requests.

Requests should be made somewhere between the user server's average use and maximum use. With more users per node, it becomes safer to make requests closer to the average use.
To cause a significant remainder of unscheduled capacity
Requests should pack well on nodes, leaving little unscheduled capacity. This can fail by requesting for example 51% or 26% of an available resource. Then the node would only fit 1 and 3 users respectively instead of the more appropriate 2 and 4, leaving 49% and 22% of a nodes capacity unscheduled for use.

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Oct 24, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Oct 24, 2023

consideRatio self-assigned this Oct 24, 2023

consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 24, 2023

yuvipanda added the allocation:internal-eng label Mar 20, 2024

consideRatio closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Theory about improving user node's utilization #3309

Theory about improving user node's utilization #3309

consideRatio commented Oct 24, 2023

Theory about improving user node's utilization #3309

Theory about improving user node's utilization #3309

Comments

consideRatio commented Oct 24, 2023

Too high remainder capacity

Badly tuned resource requests/limits