Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Theory about improving user node's utilization #3309

Closed
consideRatio opened this issue Oct 24, 2023 · 0 comments
Closed

Theory about improving user node's utilization #3309

consideRatio opened this issue Oct 24, 2023 · 0 comments
Assignees
Labels
allocation:internal-eng tech:cloud-infra Optimization of cloud infra to reduce costs etc.

Comments

@consideRatio
Copy link
Contributor

I ended up thinking about improving user node's utilization, and ended up with some thoughts I figured was worth writing down.

Let's open dedicated issues for related action points we may come up with to better focus on the theory in this issue.

Too high remainder capacity

The "remainder capacity" is the unscheduled capacity of a node pools current nodes with no scheduled users. A node pool's remainder capacity is likely the biggest cost driver for wasted capacity, and increases with larger node sizes.

Reducing the remainder capacity can be done by:

  1. using smaller node sizes
  2. decreasing segregation of users (to reduce community specific node pools for
    example)
  3. making users session durations shorter by more aggressive culling

Badly tuned resource requests/limits

Badly tuned resource requests/limits will drive costs.

Resource requests/limits for memory is most complicated to tune as the consequences of running out of memory on a node leads to termination of the user server exceeding its resource requests by the largest relative amount.

  1. To request more memory than used at any given time.
    Then its trivially a too large request.

  2. Not oversubscribing well enough
    To not oversubscribe well enough examplified with memory is to request memory too close to the memory limit, and too far above the memory used on average. The extreme case is to have the requests equal the limit.

    To avoid running out of memory on a node at any given time, user server's must at least request more memory than their average use, otherwise the node is mathematically guaranteed to run out of memory when fully scheduled based on requests.

    Requests should be made somewhere between the user server's average use and maximum use. With more users per node, it becomes safer to make requests closer to the average use.

  3. To cause a significant remainder of unscheduled capacity
    Requests should pack well on nodes, leaving little unscheduled capacity. This can fail by requesting for example 51% or 26% of an available resource. Then the node would only fit 1 and 3 users respectively instead of the more appropriate 2 and 4, leaving 49% and 22% of a nodes capacity unscheduled for use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
allocation:internal-eng tech:cloud-infra Optimization of cloud infra to reduce costs etc.
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

2 participants