-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic host volumes: volume fingerprinting #24613
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
return ctx.Done() | ||
} | ||
func (hvm *HostVolumeManager) Run() {} | ||
func (hvm *HostVolumeManager) Shutdown() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The various RPC handlers pass a request context into the calls to the plugins (which is just a timeout on the background context), but we don't have a way of signalling to plugin calls that the client is shutting down to cancel them. There's two options:
- Stick an agent shutdown context on the
HostVolumeManager
that's closed here inShutdown
. And use a joincontext as we do elsewhere in the client to handle this problem. - Derive the request context in the RPC handler from an agent shutdown context. This is a much cleaner option... but I suspect the reason we don't have that elsewhere is maybe because the agent doesn't have its own shutdown context? (Because the code base predates contexts.)
I don't think this is blocking to fix in this PR (once the agent process exits, those plugin calls are certainly going to stop!), but let's follow-up on that later.
// dynamic volumes, like CSI, have more robust `capabilities`, | ||
// so we always set ReadOnly to false, and let the scheduler | ||
// decide when to ignore this and check capabilities instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the taskrunner's volume_hook
will need to handle this too, but I'll follow-up with that later.
We did plugin fingerprinting in #24589, not to be confused with volume fingerprinting, which is done here. They are very different both conceptually and by implementation. This volume fingerprinting behaves more like device or driver plugins, even though there's not really any long-running process to maintain. This is in keeping with some of the CSI structure.
With this, the whole flow works end-to-end. The
volume create
monitor can run to completion, because the volume gets set toready
. Node API returns them, and jobs can use them like any other host volume. It also re-fingerprints properly on agent restart.(hearts manually added to draw the eye in the description here)
The volume ID is also attached to the
ClientHostVolumeConfig
, so:ref: #24479