Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using continuous prePuller #3313

Merged
merged 2 commits into from
Oct 31, 2023

Conversation

consideRatio
Copy link
Contributor

@consideRatio consideRatio commented Oct 24, 2023

Update 2023-10-30

Based on feedback from Yuvi in #3313 (comment), I've pivoted to only disable the continuous prePuller in this PR, and not also the hook prePuller. I'm opening a followup PR to this to constrain the use of hook prePuller in another PR to help us avoid issues with it.

Original PR

Fixes #2344 by inlining comments in basehub's default config that already disables prePuller logic by default, and by disabling both continous and hook prePuller where it is enabled for individual hubs.


prePuller is a somewhat complex machinery to reduce startup times in favorable conditions - conditions that also could change outside our chart configuration via the configurator for example. In unfavorable conditions, use of prePuller could instead increase startup times. This fragility and complexity has caused me some stress and cognitive load that I don't want to introduce to others, but instead help us all avoid.

This make me think the right call is to systematically disable both use of continious prePuller and hook prePuller.

Continue reading for some additional details

Disabling continuous pulling

The continuous prePuller could be beneficial if nodes are pre-started before a user pod arrives to them. It was initially designed with intent to be used with user-placeholder / node-placeholder pods, something we are not using in any of our clusters.

We have successfully made use of the continuous prePuller in the past when pre-starting many nodes for events. With the introduction of the practice to schedule multiple users per node, this is not as relevant. I think if we want to use it for that time, we can do a one off image pull just after pre-starting nodes instead of having it enabled by default - this is to be clarified by #2541.

Disabling hook pulling

The hook puller is active during helm ugprade, blocking the actual upgrade until pulling has been completed. I see some value in it compared to the current use of the continuous prePuller, but I think we are better off avoiding its complexity we introduce.

Read the inlined config notes about disabling it for some additional details.

After disabling, these are no longer relevant caveats

When I've heard "slow startup" be mentioned in support tickets etc, I've often ended up considering caveats below to ensure use of prePuller wasn't causing it. If we disable prePulling systematically with this PR, none of us need to consider caveats like below going onwards to be the cause of slow startups! We could focus on fixing issues caused by other things without even first ruling out its the prePuller causing it.

  • prePuller (continuous mainly) used for a hub scheduling user pods on nodes used by other hubs is likely to disturb other hubs, using a dedicated node pool has other caveats
  • prePuller doesn't pull all images we configure for use - not those via the configurator or profile_options in some profile lists
  • continuous prePuller pulling multiple images can even when pulling the right images delay startup times for non-pre-started nodes
  • hook prePuller requires use of ClusterRoleBinding resources, something that some k8s admins isn't granted if they exercise the right to replicate in a cluster they only control a namespace within

@github-actions

This comment was marked as resolved.

@consideRatio consideRatio marked this pull request as ready for review October 24, 2023 13:46
@consideRatio consideRatio requested a review from a team as a code owner October 24, 2023 13:46
@consideRatio consideRatio changed the title Stop using prePuller, both continuous and hook [Engineering agreement needed] Stop using prePuller, both continuous and hook Oct 24, 2023
@yuvipanda
Copy link
Member

Would you consider letting the hook prepuller be for what i consider the utoronto case? The specifics are:

  1. configurator is not used
  2. No profile list
  3. The images are fairly large

In particular, without this, if we change the image tag here on the repo, the next user who starts the server after a deploy will be responsible for pulling the image and get a very slow startup. With this enabled, that cost is rolled into the helm deploy.

This was very poor UX for end users at berkeley, given that it happens “randomly” from their perspective, which is what led to the creation of the hook prePuller.

@consideRatio consideRatio marked this pull request as draft October 24, 2023 14:16
@consideRatio consideRatio changed the title [Engineering agreement needed] Stop using prePuller, both continuous and hook [Engineering agreement needed] Stop using continuous prePuller and constrain use of hook prePuller Oct 29, 2023
@consideRatio consideRatio changed the title [Engineering agreement needed] Stop using continuous prePuller and constrain use of hook prePuller Stop using continuous prePuller and constrain use of hook prePuller Oct 29, 2023
@consideRatio consideRatio changed the title Stop using continuous prePuller and constrain use of hook prePuller Stop using continuous prePuller Oct 29, 2023
@consideRatio consideRatio marked this pull request as ready for review October 29, 2023 23:56
@consideRatio
Copy link
Contributor Author

consideRatio commented Oct 30, 2023

I've updated this PR to only disable the continuous prePuller for now, and I've opened a separate PR (#3341, currently in draft) to constrain the use of hook prePuller. This PR is now ready for review!

@consideRatio consideRatio changed the title Stop using continuous prePuller [Agreement needed] Stop using continuous prePuller Oct 30, 2023
@consideRatio
Copy link
Contributor Author

@GeorgianaElena @sgibson91 do you agree on this change?

Comment on lines +146 to +157
# prePuller is about pulling a one or more images identified via chart
# configuration, including singleuser.image, singleuser.profileList entries
# with a dedicated image, but not profileList entries with images' specified
# via profile_options.
prePuller:
# continuous prePuller leads to the creation of a DaemonSet that starts a
# pod on each node to pull images.
#
# It is disabled as its only relevant for nodes started before user pods
# gets scheduled on them, in other cases it could delay startup and isn't
# expected to reduce startup times.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@consideRatio, what if we add these comments as a descriptions in the schema file?

I figure, the schema is where we currently "document" basehub defaults and having these in there kind of makes sense? Maybe in the future we can use that to generate some sort of API-like docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the future we can use that to generate some sort of API-like docs?

I agree on letting basehub generate config reference docs based on descriptions for basehub added config in values.schema.yaml like done by z2jh!

I favor the approach of inlining a comment next to the changed default value in this case though.

  1. All charts require a values.yaml file, so helm chart users likely look in them to learn about default values. If we separate remarks about default values into a schema file / generated docs, they may be less findable from the values.yaml file.
  2. We don't yet have generated docs from the basehub's values.schema.yaml
  3. If/when we generate docs from the schema, should the schema really redefine parts of the schema of dependency charts to provide a description that the default value is changed?

@consideRatio consideRatio changed the title [Agreement needed] Stop using continuous prePuller Stop using continuous prePuller Oct 31, 2023
@yuvipanda
Copy link
Member

@consideRatio thoughts on how this would interact with how we 'pre-warm' nodes for an event? In those cases, I believe we increase the size of the nodepool to ensure faster startup, and the continuous prepuller would be required there (subject to the same constraints outlined in #3313 (comment)).

@consideRatio
Copy link
Contributor Author

consideRatio commented Oct 31, 2023

@consideRatio thoughts on how this would interact with how we 'pre-warm' nodes for an event?

@yuvipanda I've opened #2541 about this, my idea is that we do a single one-off pre-pull to pre-started nodes if we pre-start nodes.

helm-charts/basehub/values.yaml Outdated Show resolved Hide resolved
@consideRatio
Copy link
Contributor Author

Thank you for reviewing @GeorgianaElena and @yuvipanda!!!

@consideRatio consideRatio merged commit 7f73364 into 2i2c-org:master Oct 31, 2023
31 checks passed
@github-actions
Copy link

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/6705278176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done 🎉
Development

Successfully merging this pull request may close these issues.

Documented policy about use of JupyterHub chart's prePuller
3 participants