Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform Initiative] Hub Scale Cost Monitoring for AWS #4384

Closed
Gman0909 opened this issue Jul 8, 2024 · 19 comments
Closed

[Platform Initiative] Hub Scale Cost Monitoring for AWS #4384

Gman0909 opened this issue Jul 8, 2024 · 19 comments
Assignees
Labels
Product User-facing features, behavior, UX, etc

Comments

@Gman0909
Copy link
Contributor

Gman0909 commented Jul 8, 2024

Productboard link: https://2i2c.productboard.com/roadmap/7947557-2i2c-roadmap/features/26823459

Description

Institutional leads, as well as department leads in large organizations, need to be able to justify their budgets and ensure they are being spent with value in mind. Business intelligence depends on data, and we want to make sure we build towards a data reporting infrastructure that can make a hub or constellation of hubs' usage and cost more transparent, enabling better decision making come budget time, as well as offering a sense of security and transparency over a service that is often perceived as being a high risk for cost overruns.

To that end, we would like to give community and institutional leaders the ability to monitor the cost and usage of a hub, or groups of hubs, provided by 2i2c.

The solution should provide a dashboard that automatically updates to reflect up to date aggregated costs and usage reports for each hub in a constellation, or the single hub an administrator has admin rights over. Data should be able to be exportable in the form of reports.

Additionally, we should investigate adding an option to share the dashboard with individuals outside of those with administrative privileges.

Typical use cases:

  • Hub admin wants to keep an eye on their hub's costs to ensure they fall within budgetary lines
  • Institutional lead wants to know which departments are costing the most, to ensure they are staying within budget
  • Admins and institutional leads want to make sure their hubs are getting the usage they expect, to ensure they're getting their money's worth
  • Admins want to see how many active users are on a hub over a defined period of time
  • An institutional lead wants to print a CSV report they can use to plot a chart of their hubs' costs across departments for an upcoming budget review.

Scope

We already have a document listing the things cloud providers charge you for. The things we care about, in priority order, are:

  1. Home directory storage (NFS)
  2. Object storage (scratch and persistent)
  3. Compute (nodes)

Each of these costs should be attributable to either:

  1. A common pool that serves all of the hubs (primarily, the core node pool and staging hubs)
  2. A particular hub

Attributing to individual users, or specific subgroups inside a hub, are out of scope.

Definition of Done

Admins of any 2i2c hub can access dashboards and reports where they can monitor up-to-date cost information for their hubs, and export reports with that same information.

@Gman0909 Gman0909 added the Product User-facing features, behavior, UX, etc label Jul 8, 2024
@yuvipanda yuvipanda changed the title Hub Scale Cost Monitoring [Initiative] Hub Scale Cost Monitoring Jul 18, 2024
@yuvipanda
Copy link
Member

I've considered OpenCost, and discarded it as not being able to satisfy our needs.

@jnywong
Copy link
Member

jnywong commented Jul 26, 2024

Just reading the Cloudbank ACM paper and they mentioned a closed source solution called Nutanix BEAM. Are we specifically leaning into open-source solutions here?

@Gman0909
Copy link
Contributor Author

Gman0909 commented Jul 26, 2024

Has anyone checked out OpenCost? It's based on Prometheus. It looks like it might have potential for dedicated clusters at least.

@jnywong
Copy link
Member

jnywong commented Jul 26, 2024

@Gman0909 see yuvi's comment above 😆

@Gman0909
Copy link
Contributor Author

D'oh.

@aprilmj
Copy link

aprilmj commented Sep 9, 2024

@Gman0909 and @haroldcampbell will have an offline conversation about how we refine the tasks associated with this (particularly how we get unblocked). There are a lot of unknowns even after spikes (see #4453).

@aprilmj
Copy link

aprilmj commented Sep 23, 2024

Current update: we have a replicable solution for AWS; what's next? Need a definition of done & would like a showcase - James, Jenny & Jim would like to be able to use this, get feedback from Openscapes and be able to show/tell others how to use.

Note from @yuvipanda the Openscapes folks are giving feedback via an informal showcase every week, and metrics we chose were determine what features to build.

from @Gman0909: https://grafana.openscapes.2i2c.cloud/d/edw06h7udjwg0b/cloud-cost-attribution?orgId=1&from=now%2FfQ&to=now%2FfQ
Dash accessible via GitHub login

@aprilmj
Copy link

aprilmj commented Sep 23, 2024

Action to take next:

  1. @colliand and @Gman0909 and @yuvipanda - how do we formalize the things we've done with Openscapes/on this initiative so far into a repeatable process we can follow for future development of this feature and others?
  2. Post-mortem on this feature (@haroldcampbell to own making that happen)

@Gman0909 Gman0909 changed the title [Initiative] Hub Scale Cost Monitoring [Platform Initiative] Hub Scale Cost Monitoring Oct 22, 2024
@Gman0909 Gman0909 changed the title [Platform Initiative] Hub Scale Cost Monitoring [Platform Initiative] Hub Scale Cost Monitoring for AWS Nov 13, 2024
@GeorgianaElena
Copy link
Member

I've now updated the top comment to contain all the remaining open issues related to this initiative and closed the parent epic issue.

There is still #4872 (comment) that isn't captured yet. If this was a feature request that @Gman0909 or @jnywong has more context about, do you mind opening an issue about it or sharing the context here and I'll open the issue later. Thank you!

@jnywong
Copy link
Member

jnywong commented Nov 15, 2024

Thanks @GeorgianaElena !

I don't have context for that comment – would be useful to record this as an insight on ProductBoard @consideRatio if you haven't already 👍

@consideRatio
Copy link
Member

consideRatio commented Nov 18, 2024

The https://github.com/2i2c-org/meta/issues/1511#issuecomment-2393887518 captures the context best for what I meant in #4872 (comment), its was a feature that Tasha Snow observed to be relevant and missing.

@Gman0909 summarized this:

The grouping of component costs under functional headings like (networking, object storage etc) is useful for reporting, and really nice for talking to the PI of a project. But Tasha would really like the more granular base metrics.

  • Breakdown by machine types would be great.

I figure what this means technically to implement this, is that we would add a new panel in the dashboard, using data filtered on "compute" costs separate hubs, and then group by a new AWS tag such as alpha.eksctl.io/nodegroup-name, allowing us to present the compute cost broken down by node group name.

It isn't obvious it should be implemented just because it was found relevant, but I wanted to help ensure we don't forget about this observation so a decision can be made to go or not go for it.

@Gman0909
Copy link
Contributor Author

The task list has been converted to sub issues for better tracking.

@GeorgianaElena and @sgibson91, the original intent was to try and get this wrapped up by the end of the sprint, are you confident we're on track to achieve that?

@sgibson91
Copy link
Member

@Gman0909 I would say #5077 would mean no. That's a mammoth task. It also was not committed for this sprint, which could've been another planning error.

@GeorgianaElena
Copy link
Member

I would say #5077 would mean no. That's a mammoth task. It also was not committed for this sprint, which could've been another planning error.

This wasn't a planning error, but @yuvipanda and me deciding to hold off doing the EFS split until we saw how the EBS migration was going. Given that is happening ok, I've closed the #5077 as the EBS migrations that will happen as part of #5010 will supersede the EFS split work.

@GeorgianaElena and @sgibson91, the original intent was to try and get this wrapped up by the end of the sprint, are you confident we're on track to achieve that?

Yes, we're on track

@sgibson91
Copy link
Member

@GeorgianaElena At the minute, #5010 does not track rolling out EBS to all AWS hubs - only nmfs-openscapes, cryocloud and veda

@sgibson91
Copy link
Member

I guess I'm just raising that there is now a dependency between this initiative and #5010 and I don't know how that affects the Definition of Done here.

From my perspective, the EBS work will not be done by the end of this sprint. I don't even intend to start work on the next community until next sprint. So that does mean that there won't be complete cost info in grafana for all AWS hubs by the end of this sprint. Maybe we're fine with that and we just adjust the DoD accordingly?

@GeorgianaElena
Copy link
Member

So that does mean that there won't be complete cost info in grafana for all AWS hubs by the end of this sprint. Maybe we're fine with that and we just adjust the DoD accordingly?

This initiative was considered complete even without the nodegroup and homedirs split.
In order to still be able to offer per hub costs, we re-prioritized the split and the timeline was extended with one additional sprint.
Because we could not commit to doing everything in one sprint, we compromised and decided to leave that to be covered via the jupyterhub-home-nfs effort.

So it is my understanding that we are fine with the current plan. However, because I wasn't involved throughout the entire timeline of the initiative I will leave @yuvipanda to chime in if additional historical context is needed.

@GeorgianaElena
Copy link
Member

Since all the sub-issues of this initiative have been wrapped up, I will close this issue 🎉 Thank you all for all the work on this one! ❤

@Gman0909
Copy link
Contributor Author

Thanks so much to everyone who contributed to this massive effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Product User-facing features, behavior, UX, etc
Projects
None yet
Development

No branches or pull requests

7 participants