Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LEAP has less than 10% free disk space #3230

Closed
yuvipanda opened this issue Oct 4, 2023 · 10 comments
Closed

LEAP has less than 10% free disk space #3230

yuvipanda opened this issue Oct 4, 2023 · 10 comments
Assignees

Comments

@yuvipanda
Copy link
Member

Just got an alert saying LEAP has less than 10% free disk space. Work with the community rep to figure out if we should increase the size or they ask some folks to cleanup.

@jmunroe
Copy link
Contributor

jmunroe commented Oct 6, 2023

I'll write that email. I also need to check in with LEAP re: their cloud spend (currently at 50% of their allocation but also this is 6 months into their funding period so they are right on track)

@jmunroe
Copy link
Contributor

jmunroe commented Oct 6, 2023

@jmunroe
Copy link
Contributor

jmunroe commented Oct 6, 2023

I hadn't noticed previously that Yuvi had already started that communication: https://2i2c.freshdesk.com/a/tickets/995

@jmunroe
Copy link
Contributor

jmunroe commented Oct 6, 2023

Merged my email and Yuvi's into one ticket: https://2i2c.freshdesk.com/a/tickets/995

(@yuvi -- apologizes for butting into this. I hadn't realized you were already on top of it. I read 'contact community rep' in the github description, noticed that it didn't have anyone assigned, and assumed that meant I should be engaging with LEAP in someway.)

I note (mostly for my own reference) that this issue of LEAP diskspace is directly connected with the still open ticket https://2i2c.freshdesk.com/a/tickets/387 that references:

@jmunroe
Copy link
Contributor

jmunroe commented Oct 6, 2023

For home directories storage are per-users quotas a feature we can enable?

Looking at https://z2jh.jupyter.org/en/stable/jupyterhub/customizing/user-storage.html I see options like

image

but perhaps the we use Google Filestore prevents this quota feature from being an option. I agree that implementing per-user quotas on home directories would be ideal for LEAP (and possibly for many other communities). Since that is not in place I assume it do to a constraint that I do not understand.

Related to the space issue, Julius is requesting better tooling for measuring, reporting, and notifying user space. I can think of a few ways to approach this in a more traditional HPC setting but I am less certain on the options with Kubernetes PVCs, StorageClasses, and PVs.

@yuvipanda
Copy link
Member Author

yuvipanda commented Oct 6, 2023

Sorry didn't fully see this, @jmunroe.

Yes, using Google Filestore is the reason we can't use the default from z2jh. It uses dynamic disk provisioning, which gets one disk per user. Quoting the problems with that here:

We could go nuclear and get rid of google filestore completely, and switch everyone to one single disk per person with a set size. This would entail that if they filled up their disk, their server would just no longer start - and we'll have to find a way to at least allow it to start so they can clean it up. However, the bigger problem here is cost - Filestore costs 0.000219$ per GB, while an equivalent amount of persistent disk costs $0.04 per GB! That's almost a 200x difference in cost, which is why we're on filestore. The other problem there is that without filestore, we'd be paying for capacity rather than use in an even worse way than we are now - if each user's limit is set to 10G for example, we will pay for 10G continuously regardless of how much they are using. And once allocated, it can not be resized down - only up. We also don't have a clear way to give some people more storage and others less. Whenever we had done this in the past (we ran UC Berkeley like this for a while), the cost of storage almost always dwarfs the cost of compute by order of magnitude.

In a HPC system or similar, they'd probably run their own NFS server and use something like XFS with their project quota for per-diretory quotas. We'd need to run our own NFS server to do this, and I think currently the team doesn't have enough capacity with filesystem engineering for us to do that.

I did make a new dashboard with an actual table that can be used to look at users, much much better than the graph that currently exists.

https://grafana.leap.2i2c.cloud/d/bd232539-52d0-4435-8a62-fe637dc822be/home-directory-usage-dashboard?orgId=1&editPanel=2

image

While this helps, I do agree that more automation needs to be made available.

@jmunroe
Copy link
Contributor

jmunroe commented Oct 6, 2023

Thanks @yuvipanda for the explanation. I agree that running our own NFS server is not something we can support in the near term.

So maybe not ideal, but it sounds like "someone" (Julius? me?) should write a tool that scrapes this table (or more likely grabs the equivalent data with an api call), then sends an automated email or Slack message to each user with a warning advising them that they are way over their quota. A cron job or Zapier / IFTT like tools could be used.

Since it involves emailing users directly I wouldn't want 2i2c to actually run that service but perhaps it is a general enough issue across other communities that documenting how to set up some sort of automated "email my users daily if over quota" tool.

@jbusecke -- is this kind of thing that would be useful to you? Did you want me to attempt to mock up something that you could then use to make your life easier? This may even give you a way to have a weekly report emailed automatically to you flagging "unusual" activity. Again, I don't think 2i2c would be able to run that automation for you but I am happy to work on it because I think other Hub Champions face similar issues.

@yuvipanda
Copy link
Member Author

The other piece relevant from my response on freshdesk is probably:

Notification is also not straightforward - everyone uses GitHub auth, so am not sure how exactly an automated process would be able to reach out to them?

But otherwise I think your message here matches what I had suggested we try to do on freshdesk.

@yuvipanda
Copy link
Member Author

@jmunroe what is the status of this now? Can we close this?

@jbusecke
Copy link
Contributor

These notifications are helpful, but as I elaborated via email, I think ultimately not enough to ensure smooth operation from many users. But from my end this issue can be closed now.

@github-project-automation github-project-automation bot moved this from Todo 👍 to Done 🎉 in Sprint Board Jan 15, 2024
@github-project-automation github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

No branches or pull requests

3 participants