dynamic host volumes: volume fingerprinting #24613

gulducat · 2024-12-05T22:13:18Z

We did plugin fingerprinting in #24589, not to be confused with volume fingerprinting, which is done here. They are very different both conceptually and by implementation. This volume fingerprinting behaves more like device or driver plugins, even though there's not really any long-running process to maintain. This is in keeping with some of the CSI structure.

With this, the whole flow works end-to-end. The volume create monitor can run to completion, because the volume gets set to ready. Node API returns them, and jobs can use them like any other host volume. It also re-fingerprints properly on agent restart.

$ nomad volume create host.volume.hcl
==> Created host volume test with ID 1e1d4540-14f9-2515-8502-24d46291717d
  ✓ Host volume "1e1d4540" ready 💚

    2024-12-05T17:06:46-05:00
    ID        = 1e1d4540-14f9-2515-8502-24d46291717d
    Name      = test
    Namespace = default
    Plugin ID = example-host-volume
    Node ID   = ae118273-5416-3ed7-01dd-10fe8bf49e81
    Node Pool = default
    Capacity  = 47 MiB
    State     = ready  💚
    Host Path = /opt/nomad/data/alloc_mounts/1e1d4540-14f9-2515-8502-24d46291717d

(hearts manually added to draw the eye in the description here)

The volume ID is also attached to the ClientHostVolumeConfig, so:

$ nomad node status -self -json | jq '.HostVolumes'
{
  "cloud-img": {
    "ID": "",      <== "classic" static host volume
    "Path": "/opt/nomad/cloud-img",
    "ReadOnly": false
  },
  "test": {
    "ID": "1e1d4540-14f9-2515-8502-24d46291717d",   <== dynamic host volume
    "Path": "/opt/nomad/data/alloc_mounts/1e1d4540-14f9-2515-8502-24d46291717d",
    "ReadOnly": false
  }
}

ref: #24479

nomad/structs/volumes.go

client/hostvolumemanager/host_volumes.go

nomad/host_volume_endpoint.go

api/nodes.go

client/node_updater.go

client/hostvolumemanager/host_volumes_test.go

client/hostvolumemanager/host_volumes.go

tgross

LGTM!

tgross · 2024-12-09T16:01:17Z

client/hostvolumemanager/volume_fingerprint.go

+	return ctx.Done()
+}
+func (hvm *HostVolumeManager) Run()      {}
+func (hvm *HostVolumeManager) Shutdown() {}


The various RPC handlers pass a request context into the calls to the plugins (which is just a timeout on the background context), but we don't have a way of signalling to plugin calls that the client is shutting down to cancel them. There's two options:

Stick an agent shutdown context on the HostVolumeManager that's closed here in Shutdown. And use a joincontext as we do elsewhere in the client to handle this problem.

Derive the request context in the RPC handler from an agent shutdown context. This is a much cleaner option... but I suspect the reason we don't have that elsewhere is maybe because the agent doesn't have its own shutdown context? (Because the code base predates contexts.)

I don't think this is blocking to fix in this PR (once the agent process exits, those plugin calls are certainly going to stop!), but let's follow-up on that later.

tgross · 2024-12-09T16:03:19Z

client/hostvolumemanager/host_volumes.go

+		// dynamic volumes, like CSI, have more robust `capabilities`,
+		// so we always set ReadOnly to false, and let the scheduler
+		// decide when to ignore this and check capabilities instead.


I think the taskrunner's volume_hook will need to handle this too, but I'll follow-up with that later.

and expand the demo a bit

gulducat added 4 commits December 5, 2024 17:03

working, but bad goroutine pattern

610141e

callback implementation in keeping with others

9ca5816

fix tests

2289a38

make mkdir plugin idempotent

d24c142

gulducat added type/enhancement theme/storage labels Dec 5, 2024

gulducat added this to the 1.10.0 milestone Dec 5, 2024

gulducat requested review from pkazmierczak and tgross December 5, 2024 22:13

tgross reviewed Dec 6, 2024

View reviewed changes

gulducat added 3 commits December 6, 2024 10:23

document UpdateVolumeMap more clearly

80d6369

config values, more docstring

c7c10e8

interface tweaks, and misc

2567ffd

vercel bot deployed to Preview – nomad-ui December 6, 2024 17:24 View deployment

gulducat added 3 commits December 6, 2024 12:26

make check happy

7afb84a

use a hostvolumemanager.Config{}

e37b558

rearrange, fix and test UpdateVolumeMap

eb5e18c

vercel bot deployed to Preview – nomad-ui December 6, 2024 20:00 View deployment

copywrong

3196880

vercel bot deployed to Preview – nomad-ui December 6, 2024 20:04 View deployment

dynamic host volumes: expand demo

6046ef2

vercel bot deployed to Preview – nomad-ui December 6, 2024 21:43 View deployment

gee wiz, the damage an extra 50s load time can do

1013b4f

vercel bot deployed to Preview – nomad-ui December 6, 2024 22:23 View deployment

gulducat marked this pull request as ready for review December 6, 2024 23:20

gulducat requested review from a team as code owners December 6, 2024 23:20

tgross approved these changes Dec 9, 2024

View reviewed changes

gulducat merged commit 1f8c378 into dynamic-host-volumes Dec 9, 2024
18 checks passed

gulducat deleted the dhv-volume-fingerprint branch December 9, 2024 20:26

tgross pushed a commit that referenced this pull request Dec 9, 2024

dynamic host volumes: volume fingerprinting (#24613)

a447645

and expand the demo a bit

tgross pushed a commit that referenced this pull request Dec 13, 2024

dynamic host volumes: volume fingerprinting (#24613)

d579dae

and expand the demo a bit

tgross pushed a commit that referenced this pull request Dec 19, 2024

dynamic host volumes: volume fingerprinting (#24613)

e76f5e0

and expand the demo a bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic host volumes: volume fingerprinting #24613

dynamic host volumes: volume fingerprinting #24613

gulducat commented Dec 5, 2024 •

edited

Loading

tgross left a comment

tgross Dec 9, 2024

tgross Dec 9, 2024

dynamic host volumes: volume fingerprinting #24613

dynamic host volumes: volume fingerprinting #24613

Conversation

gulducat commented Dec 5, 2024 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross Dec 9, 2024

Choose a reason for hiding this comment

tgross Dec 9, 2024

Choose a reason for hiding this comment

gulducat commented Dec 5, 2024 •

edited

Loading