-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop using continuous prePuller #3313
Stop using continuous prePuller #3313
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
Would you consider letting the hook prepuller be for what i consider the utoronto case? The specifics are:
In particular, without this, if we change the image tag here on the repo, the next user who starts the server after a deploy will be responsible for pulling the image and get a very slow startup. With this enabled, that cost is rolled into the helm deploy. This was very poor UX for end users at berkeley, given that it happens “randomly” from their perspective, which is what led to the creation of the hook prePuller. |
347be22
to
d403b4c
Compare
I've updated this PR to only disable the continuous prePuller for now, and I've opened a separate PR (#3341, currently in draft) to constrain the use of hook prePuller. This PR is now ready for review! |
@GeorgianaElena @sgibson91 do you agree on this change? |
# prePuller is about pulling a one or more images identified via chart | ||
# configuration, including singleuser.image, singleuser.profileList entries | ||
# with a dedicated image, but not profileList entries with images' specified | ||
# via profile_options. | ||
prePuller: | ||
# continuous prePuller leads to the creation of a DaemonSet that starts a | ||
# pod on each node to pull images. | ||
# | ||
# It is disabled as its only relevant for nodes started before user pods | ||
# gets scheduled on them, in other cases it could delay startup and isn't | ||
# expected to reduce startup times. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@consideRatio, what if we add these comments as a descriptions in the schema file?
I figure, the schema is where we currently "document" basehub defaults and having these in there kind of makes sense? Maybe in the future we can use that to generate some sort of API-like docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the future we can use that to generate some sort of API-like docs?
I agree on letting basehub generate config reference docs based on descriptions for basehub added config in values.schema.yaml like done by z2jh!
I favor the approach of inlining a comment next to the changed default value in this case though.
- All charts require a values.yaml file, so helm chart users likely look in them to learn about default values. If we separate remarks about default values into a schema file / generated docs, they may be less findable from the values.yaml file.
- We don't yet have generated docs from the basehub's values.schema.yaml
- If/when we generate docs from the schema, should the schema really redefine parts of the schema of dependency charts to provide a description that the default value is changed?
@consideRatio thoughts on how this would interact with how we 'pre-warm' nodes for an event? In those cases, I believe we increase the size of the nodepool to ensure faster startup, and the continuous prepuller would be required there (subject to the same constraints outlined in #3313 (comment)). |
@yuvipanda I've opened #2541 about this, my idea is that we do a single one-off pre-pull to pre-started nodes if we pre-start nodes. |
Co-authored-by: Yuvi Panda <[email protected]>
Thank you for reviewing @GeorgianaElena and @yuvipanda!!! |
🎉🎉🎉🎉 Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/6705278176 |
Update 2023-10-30
Based on feedback from Yuvi in #3313 (comment), I've pivoted to only disable the continuous prePuller in this PR, and not also the hook prePuller. I'm opening a followup PR to this to constrain the use of hook prePuller in another PR to help us avoid issues with it.
Original PR
Fixes #2344 by inlining comments in basehub's default config that already disables prePuller logic by default, and by disabling both continous and hook prePuller where it is enabled for individual hubs.
prePuller is a somewhat complex machinery to reduce startup times in favorable conditions - conditions that also could change outside our chart configuration via the configurator for example. In unfavorable conditions, use of prePuller could instead increase startup times. This fragility and complexity has caused me some stress and cognitive load that I don't want to introduce to others, but instead help us all avoid.
This make me think the right call is to systematically disable both use of continious prePuller and hook prePuller.
Continue reading for some additional details
Disabling continuous pulling
The continuous prePuller could be beneficial if nodes are pre-started before a user pod arrives to them. It was initially designed with intent to be used with
user-placeholder
/node-placeholder
pods, something we are not using in any of our clusters.We have successfully made use of the continuous prePuller in the past when pre-starting many nodes for events. With the introduction of the practice to schedule multiple users per node, this is not as relevant. I think if we want to use it for that time, we can do a one off image pull just after pre-starting nodes instead of having it enabled by default - this is to be clarified by #2541.
Disabling hook pulling
The hook puller is active during
helm ugprade
, blocking the actual upgrade until pulling has been completed. I see some value in it compared to the current use of the continuous prePuller, but I think we are better off avoiding its complexity we introduce.Read the inlined config notes about disabling it for some additional details.
After disabling, these are no longer relevant caveats
When I've heard "slow startup" be mentioned in support tickets etc, I've often ended up considering caveats below to ensure use of prePuller wasn't causing it. If we disable prePulling systematically with this PR, none of us need to consider caveats like below going onwards to be the cause of slow startups! We could focus on fixing issues caused by other things without even first ruling out its the prePuller causing it.