Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic host volumes: account for other claims in capability check #24684

Open
wants to merge 2 commits into
base: dynamic-host-volumes
Choose a base branch
from

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 16, 2024

When we feasibility check a dynamic host volume against a volume request, we check the attachment mode and access mode. This only ensures that the capabilities match, but doesn't enforce the semantics of the capabilities against other claims that may be made on the allocation.

Add support for checking the requested capability against other allocations that the volume claimed.

Ref: #24479

@tgross tgross added this to the 1.10.0 milestone Dec 16, 2024
@tgross tgross force-pushed the dhv-scheduling-caps-counts branch from 80e3481 to 44e88d5 Compare December 17, 2024 20:46
tgross added a commit that referenced this pull request Dec 17, 2024
CSI volumes support multi-node access patterns on the same volume ID, but
dynamic host volumes by nature do not. The underlying volume may actually be
multi-node (ex. NFS), but Nomad is ignorant of this. Remove the CSI-specific
multi-node access modes and instead include the single-node access modes
intended that are currently in the alpha edition of the CSI spec but which are
better suited for DHV.

This PR has been extracted from #24684 to keep reviews manageable.

Ref: #24479
Ref: #24684
tgross added a commit that referenced this pull request Dec 17, 2024
CSI volumes support multi-node access patterns on the same volume ID, but
dynamic host volumes by nature do not. The underlying volume may actually be
multi-node (ex. NFS), but Nomad is ignorant of this. Remove the CSI-specific
multi-node access modes and instead include the single-node access modes
intended that are currently in the alpha edition of the CSI spec but which are
better suited for DHV.

This PR has been extracted from #24684 to keep reviews manageable.

Ref: #24479
Ref: #24684
tgross added a commit that referenced this pull request Dec 18, 2024
CSI volumes support multi-node access patterns on the same volume ID, but
dynamic host volumes by nature do not. The underlying volume may actually be
multi-node (ex. NFS), but Nomad is ignorant of this. Remove the CSI-specific
multi-node access modes and instead include the single-node access modes
intended that are currently in the alpha edition of the CSI spec but which are
better suited for DHV.

This PR has been extracted from #24684 to keep reviews manageable.

Ref: #24479
Ref: #24684
@tgross tgross force-pushed the dhv-scheduling-caps-counts branch from 44e88d5 to 447e41a Compare December 18, 2024 14:21
@tgross tgross marked this pull request as ready for review December 18, 2024 14:21
@tgross tgross requested review from a team as code owners December 18, 2024 14:21
@tgross tgross force-pushed the dhv-scheduling-caps-counts branch from 447e41a to 983591d Compare December 18, 2024 16:38
@tgross tgross force-pushed the dhv-scheduling-caps-counts branch from 983591d to 366ccbf Compare December 18, 2024 18:59
@tgross tgross marked this pull request as draft December 18, 2024 19:22
@tgross
Copy link
Member Author

tgross commented Dec 18, 2024

Moving back to draft as Daniel has pointed out a fairly serious design flaw in how this works with group.count > 1 on the first deploy. Fixed!

When we feasibility check a dynamic host volume against a volume request, we
check the attachment mode and access mode. This only ensures that the
capabilities match, but doesn't enforce the semantics of the capabilities
against other claims that may be made on the allocation.

Add support for checking the requested capability against other allocations that
the volume claimed.

Ref: #24479
@tgross tgross force-pushed the dhv-scheduling-caps-counts branch from 482a6ea to bbd1e22 Compare December 18, 2024 21:21
@tgross tgross marked this pull request as ready for review December 18, 2024 21:23
@tgross
Copy link
Member Author

tgross commented Dec 18, 2024

We're now using the scheduler's context's ProposedAllocs (requiring a trip back to the state store to get the Job structs, but that's no big deal), and that lets us remove the previously-proposed tracking of the previous alloc ID, counting writers, etc.

In addition to the unit tests, I've run this across deployments of a job with group.count=2 and the single-node-single-writer mode to verify everything is working as we'd expect.

Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! just one bit of comment clarity

Comment on lines 296 to 298
// all allocs for the same job will have the same read-only flag
// and capabilities, so we only need to check a given job once
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what this is doing, but I don't follow the reasoning in this comment. I read it as meaning that different groups in the job will all have the same volume params, but this is more about allocs per group, specifically count, right?

Suggested change
// all allocs for the same job will have the same read-only flag
// and capabilities, so we only need to check a given job once
continue
// all allocs for the same task group will have the same read-only
// flag and capabilities, so we only need to check a given job once
continue

using the job's ns+id as the seen key, without any reference to group, threw me a bit, I think. not that you should change that, just that the comment is pretty important to clarify this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, very good point. So that's actually a bug here because if a job has multiple allocs for different task groups on the same node, then we'd potentially miss volume requests to check. Ex. a job has alloc A for group A and alloc B for group B on the same node, and only group B has a volume request. If we happen to check alloc A first we'd miss that.

Will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants