Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull pods that run for longer than 7 days #3042

Merged
merged 2 commits into from
Oct 19, 2023

Conversation

yuvipanda
Copy link
Member

Ref #3015

@yuvipanda yuvipanda requested a review from a team as a code owner August 28, 2023 20:12
@github-actions
Copy link

github-actions bot commented Aug 28, 2023

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
aws openscapes No Yes Core infrastructure has been modified
aws ubc-eoas No Yes Core infrastructure has been modified
gcp qcl No Yes Core infrastructure has been modified
gcp callysto No Yes Core infrastructure has been modified
aws jupyter-meets-the-earth No Yes Core infrastructure has been modified
gcp 2i2c No Yes Core infrastructure has been modified
gcp linked-earth No Yes Core infrastructure has been modified
gcp 2i2c-uk No Yes Core infrastructure has been modified
aws victor No Yes Core infrastructure has been modified
aws nasa-veda No Yes Core infrastructure has been modified
aws nasa-cryo No Yes Core infrastructure has been modified
gcp cloudbank No Yes Core infrastructure has been modified
gcp awi-ciroh No Yes Core infrastructure has been modified
gcp meom-ige No Yes Core infrastructure has been modified
gcp catalystproject-latam No Yes Core infrastructure has been modified
gcp hhmi No Yes Core infrastructure has been modified
aws catalystproject-africa No Yes Core infrastructure has been modified
aws 2i2c-aws-us No Yes Core infrastructure has been modified
gcp m2lines No Yes Core infrastructure has been modified
aws carbonplan No Yes Core infrastructure has been modified
gcp leap No Yes Core infrastructure has been modified
aws gridsst No Yes Core infrastructure has been modified
gcp pangeo-hubs No Yes Core infrastructure has been modified
aws nasa-ghg No Yes Core infrastructure has been modified
kubeconfig utoronto No Yes Core infrastructure has been modified
aws smithsonian No Yes Core infrastructure has been modified

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
aws openscapes prod Core infrastructure has been modified
aws ubc-eoas prod Core infrastructure has been modified
gcp qcl prod Core infrastructure has been modified
gcp callysto prod Core infrastructure has been modified
aws jupyter-meets-the-earth prod Core infrastructure has been modified
gcp 2i2c imagebuilding-demo Core infrastructure has been modified
gcp 2i2c demo Core infrastructure has been modified
gcp 2i2c ohw Core infrastructure has been modified
gcp 2i2c aup Core infrastructure has been modified
gcp 2i2c temple Core infrastructure has been modified
gcp 2i2c ucmerced Core infrastructure has been modified
gcp 2i2c climatematch Core infrastructure has been modified
gcp 2i2c neurohackademy Core infrastructure has been modified
gcp 2i2c mtu Core infrastructure has been modified
gcp 2i2c jackeddy Core infrastructure has been modified
gcp linked-earth prod Core infrastructure has been modified
gcp 2i2c-uk lis Core infrastructure has been modified
aws victor prod Core infrastructure has been modified
aws nasa-veda prod Core infrastructure has been modified
aws nasa-cryo prod Core infrastructure has been modified
gcp cloudbank bcc Core infrastructure has been modified
gcp cloudbank ccsf Core infrastructure has been modified
gcp cloudbank csm Core infrastructure has been modified
gcp cloudbank dvc Core infrastructure has been modified
gcp cloudbank elcamino Core infrastructure has been modified
gcp cloudbank evc Core infrastructure has been modified
gcp cloudbank glendale Core infrastructure has been modified
gcp cloudbank howard Core infrastructure has been modified
gcp cloudbank miracosta Core infrastructure has been modified
gcp cloudbank skyline Core infrastructure has been modified
gcp cloudbank demo Core infrastructure has been modified
gcp cloudbank fresno Core infrastructure has been modified
gcp cloudbank humboldt Core infrastructure has been modified
gcp cloudbank laney Core infrastructure has been modified
gcp cloudbank sbcc Core infrastructure has been modified
gcp cloudbank sbcc-dev Core infrastructure has been modified
gcp cloudbank lacc Core infrastructure has been modified
gcp cloudbank lamission Core infrastructure has been modified
gcp cloudbank mills Core infrastructure has been modified
gcp cloudbank mission Core infrastructure has been modified
gcp cloudbank norco Core infrastructure has been modified
gcp cloudbank palomar Core infrastructure has been modified
gcp cloudbank pasadena Core infrastructure has been modified
gcp cloudbank sjcc Core infrastructure has been modified
gcp cloudbank sacramento Core infrastructure has been modified
gcp cloudbank srjc Core infrastructure has been modified
gcp cloudbank saddleback Core infrastructure has been modified
gcp cloudbank santiago Core infrastructure has been modified
gcp cloudbank sjsu Core infrastructure has been modified
gcp cloudbank tuskegee Core infrastructure has been modified
gcp cloudbank wlac Core infrastructure has been modified
gcp cloudbank csulb Core infrastructure has been modified
gcp cloudbank csum Core infrastructure has been modified
gcp awi-ciroh prod Core infrastructure has been modified
gcp meom-ige prod Core infrastructure has been modified
gcp catalystproject-latam unitefa-conicet Core infrastructure has been modified
gcp hhmi prod Core infrastructure has been modified
aws catalystproject-africa nm-aist Core infrastructure has been modified
aws 2i2c-aws-us researchdelight Core infrastructure has been modified
aws 2i2c-aws-us ncar-cisl Core infrastructure has been modified
aws 2i2c-aws-us go-bgc Core infrastructure has been modified
aws 2i2c-aws-us itcoocean Core infrastructure has been modified
aws 2i2c-aws-us cosmicds Core infrastructure has been modified
gcp m2lines prod Core infrastructure has been modified
aws carbonplan prod Core infrastructure has been modified
gcp leap prod Core infrastructure has been modified
aws gridsst prod Core infrastructure has been modified
gcp pangeo-hubs prod Core infrastructure has been modified
gcp pangeo-hubs coessing Core infrastructure has been modified
aws nasa-ghg prod Core infrastructure has been modified
kubeconfig utoronto prod Core infrastructure has been modified
kubeconfig utoronto r-prod Core infrastructure has been modified
aws smithsonian prod Core infrastructure has been modified

@consideRatio
Copy link
Contributor

I agree with this change, but i dont think we should introduce it silently. I think this is something the QCL community should be informed about at least as they have in the past had longer time running calculations on a single node.

I think it could make sense to notify all community champions about this, and i figure this config is us influencing the user env - so it should be documented somewhere for them.

I opened #3017 about communication in situations like this among other things as i've felt that I lack agency on how to communicate a change influencing users if we've decided its the right thing to do.

For this change, i suggest:

  • we communicate that we planning this change at a certain date, referencing docs we've written explaining the default of having a max age.
  • invite users to opt-out or get another max-age
  • do the change

@yuvipanda
Copy link
Member Author

yuvipanda commented Aug 29, 2023

Thanks @consideRatio, I think that makes sense.

I don't think I've the capacity to work on communicating at this point, and am not sure who does. So I'm going to unassign this to myself as I don't think I can take this to completion.

@jmunroe
Copy link
Contributor

jmunroe commented Oct 5, 2023

We just received feedback from QCL (https://2i2c.freshdesk.com/a/tickets/972) on this:

7 days would be excellent for us. Will this be the same limit as the timeout for a disconnected session (that still has processes running)?

I think the answer to the question is "yes" but I welcome corrections if there is subtly on this.

In any case I think this response unblocks this PR!

Copy link
Contributor

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the long running jobs observed was caused by the KubeSpawner bug in 6.0.0 that has been fixed and cleaned up. With that in mind, I no longer expect this to be as breaking as I thought when seeing several long lived pods.

I opened 2i2c-org/docs#193 with relevant complementary docs.

helm-charts/basehub/values.yaml Outdated Show resolved Hide resolved
@consideRatio
Copy link
Contributor

@jmunroe i pinged you for review by mistake. I meant to assign to for attribution for work done!

@consideRatio consideRatio merged commit e2cbe30 into 2i2c-org:master Oct 19, 2023
31 checks passed
@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/6574132482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

4 participants