Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic host volumes: volume fingerprinting #24613

Merged
merged 13 commits into from
Dec 9, 2024

Conversation

gulducat
Copy link
Member

@gulducat gulducat commented Dec 5, 2024

We did plugin fingerprinting in #24589, not to be confused with volume fingerprinting, which is done here. They are very different both conceptually and by implementation. This volume fingerprinting behaves more like device or driver plugins, even though there's not really any long-running process to maintain. This is in keeping with some of the CSI structure.

With this, the whole flow works end-to-end. The volume create monitor can run to completion, because the volume gets set to ready. Node API returns them, and jobs can use them like any other host volume. It also re-fingerprints properly on agent restart.

$ nomad volume create host.volume.hcl
==> Created host volume test with ID 1e1d4540-14f9-2515-8502-24d46291717d
  ✓ Host volume "1e1d4540" ready 💚

    2024-12-05T17:06:46-05:00
    ID        = 1e1d4540-14f9-2515-8502-24d46291717d
    Name      = test
    Namespace = default
    Plugin ID = example-host-volume
    Node ID   = ae118273-5416-3ed7-01dd-10fe8bf49e81
    Node Pool = default
    Capacity  = 47 MiB
    State     = ready  💚
    Host Path = /opt/nomad/data/alloc_mounts/1e1d4540-14f9-2515-8502-24d46291717d

(hearts manually added to draw the eye in the description here)

The volume ID is also attached to the ClientHostVolumeConfig, so:

$ nomad node status -self -json | jq '.HostVolumes'
{
  "cloud-img": {
    "ID": "",      <== "classic" static host volume
    "Path": "/opt/nomad/cloud-img",
    "ReadOnly": false
  },
  "test": {
    "ID": "1e1d4540-14f9-2515-8502-24d46291717d",   <== dynamic host volume
    "Path": "/opt/nomad/data/alloc_mounts/1e1d4540-14f9-2515-8502-24d46291717d",
    "ReadOnly": false
  }
}

ref: #24479

nomad/structs/volumes.go Outdated Show resolved Hide resolved
client/hostvolumemanager/host_volumes.go Outdated Show resolved Hide resolved
nomad/host_volume_endpoint.go Show resolved Hide resolved
api/nodes.go Outdated Show resolved Hide resolved
client/node_updater.go Outdated Show resolved Hide resolved
client/hostvolumemanager/host_volumes_test.go Show resolved Hide resolved
client/hostvolumemanager/host_volumes.go Outdated Show resolved Hide resolved
@gulducat gulducat marked this pull request as ready for review December 6, 2024 23:20
@gulducat gulducat requested review from a team as code owners December 6, 2024 23:20
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

return ctx.Done()
}
func (hvm *HostVolumeManager) Run() {}
func (hvm *HostVolumeManager) Shutdown() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various RPC handlers pass a request context into the calls to the plugins (which is just a timeout on the background context), but we don't have a way of signalling to plugin calls that the client is shutting down to cancel them. There's two options:

  • Stick an agent shutdown context on the HostVolumeManager that's closed here in Shutdown. And use a joincontext as we do elsewhere in the client to handle this problem.
  • Derive the request context in the RPC handler from an agent shutdown context. This is a much cleaner option... but I suspect the reason we don't have that elsewhere is maybe because the agent doesn't have its own shutdown context? (Because the code base predates contexts.)

I don't think this is blocking to fix in this PR (once the agent process exits, those plugin calls are certainly going to stop!), but let's follow-up on that later.

Comment on lines +74 to +76
// dynamic volumes, like CSI, have more robust `capabilities`,
// so we always set ReadOnly to false, and let the scheduler
// decide when to ignore this and check capabilities instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the taskrunner's volume_hook will need to handle this too, but I'll follow-up with that later.

@gulducat gulducat merged commit 1f8c378 into dynamic-host-volumes Dec 9, 2024
18 checks passed
@gulducat gulducat deleted the dhv-volume-fingerprint branch December 9, 2024 20:26
tgross pushed a commit that referenced this pull request Dec 9, 2024
tgross pushed a commit that referenced this pull request Dec 13, 2024
tgross pushed a commit that referenced this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants