-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for DirectiveBreakdown #172
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
--- | ||
authors: Matt Richerson <[email protected]> | ||
categories: provisioning | ||
--- | ||
|
||
# Directive Breakdown | ||
|
||
## Background | ||
|
||
The `#DW` directives in a job script are not intended to be interpreted by the workload manager. The workload manager passes the `#DW` directives to the NNF software through the DWS `workflow` resource, and the NNF software determines what resources are needed to satisfy the directives. The NNF software communicates this information back to the workload manager through the DWS `DirectiveBreakdown` resource. This document describes how the WLM should interpret the information in the `DirectiveBreakdown`. | ||
|
||
## DirectiveBreakdown Overview | ||
|
||
The DWS `DirectiveBreakdown` contains all the information necessary to inform the WLM how to pick storage and compute nodes for a job. The `DirectiveBreakdown` resource is created by the NNF software during the `Proposal` phase of the DWS workflow. The `spec` section of the `DirectiveBreakdown` is filled in with the `#DW` directive by the NNF software, and the `status` section contains the information for the WLM. The WLM should wait until the `status.ready` field is true before interpreting the rest of the `status` fields. | ||
|
||
The contents of the `DirectiveBreakdown` will look different depending on the file system type and options specified by the user. The `status` section contains enough information that the WLM may be able to figure out the underlying file system type requested by the user, but the WLM should not make any decisions based on the file system type. Instead, the WLM should make storage and compute allocation decisions based on the generic information provided in the `DirectiveBreakdown` since the storage and compute allocations needed to satisfy a `#DW` directive may differ based on options other than the file system type. | ||
|
||
## Storage Nodes | ||
|
||
The `status.storage` section of the `DirectiveBreakdown` describes how the storage allocations should be made and any constraints on the NNF nodes that can be picked. The `status.storage` section will exist only for `jobdw` and `create_persistent` directives. An example of the `status.storage` section is included below. | ||
|
||
```yaml | ||
... | ||
spec: | ||
directive: '#DW jobdw capacity=1GiB type=xfs name=example' | ||
userID: 7900 | ||
status: | ||
... | ||
ready: true | ||
storage: | ||
allocationSets: | ||
- allocationStrategy: AllocatePerCompute | ||
constraints: | ||
labels: | ||
- dataworkflowservices.github.io/storage=Rabbit | ||
label: xfs | ||
minimumCapacity: 1073741824 | ||
lifetime: job | ||
reference: | ||
kind: Servers | ||
name: example-0 | ||
namespace: default | ||
... | ||
``` | ||
|
||
* `status.storage.allocationSets` is a list of storage allocation sets that are needed for the job. An allocation set is a group of individual storage allocations that all have the same parameters and requirements. Depending on the storage type specified by the user, there may be more than one allocation set. Allocation sets should be handled independently. | ||
|
||
* `status.storage.allocationSets.allocationStrategy` specifies how the allocations should be made. | ||
* `AllocatePerCompute` - One allocation is needed per compute node in the job. The size of an individual allocation is specified in `status.storage.allocationSets.minimumCapacity` | ||
* `AllocateAcrossServers` - One or more allocations are needed with an aggregate capacity of `status.storage.allocationSets.minimumCapacity`. This allocation strategy does not imply anything about how many allocations to make per NNF node or how many NNF nodes to use. The allocations on each NNF node should be the same size. | ||
* `AllocateSingleServer` - One allocation is needed with a capacity of `status.storage.allocationSets.minimumCapacity` | ||
|
||
* `status.storage.allocationSets.constraints` is a set of requirements for which NNF nodes can be picked. More information about the different constraint types is provided in the [Storage Constraints](readme.md#storage-constraints) section below. | ||
|
||
* `status.storage.allocationSets.label` is an opaque string that the WLM uses when creating the spec.allocationSets entry in the DWS `Servers` resource. | ||
|
||
* `status.storage.allocationSets.minimumCapacity` is the allocation capacity in bytes. The interpretation of this field depends on the value of `status.storage.allocationSets.allocationStrategy` | ||
|
||
* `status.storage.lifetime` is used to specify how long the storage allocations will last. | ||
* `job` - The allocation will last for the lifetime of the job | ||
* `persistent` - The allocation will last for longer than the lifetime of the job | ||
|
||
* `status.storage.reference` is an object reference to a DWS `Servers` resource where the WLM can specify allocations | ||
|
||
### Storage Constraints | ||
|
||
Constraints on an allocation set provide additional requirements for how the storage allocations should be made on NNF nodes. | ||
|
||
* `labels` specifies a list of labels that must all be on a DWS `Storage` resource in order for an allocation to exist on that `Storage`. | ||
```yaml | ||
constraints: | ||
labels: | ||
- dataworkflowservices.github.io/storage=Rabbit | ||
- mysite.org/pool=firmware_test | ||
``` | ||
```yaml | ||
apiVersion: dataworkflowservices.github.io/v1alpha2 | ||
kind: Storage | ||
metadata: | ||
labels: | ||
dataworkflowservices.github.io/storage: Rabbit | ||
mysite.org/pool: firmware_test | ||
mysite.org/drive-speed: fast | ||
name: rabbit-node-1 | ||
namespace: default | ||
... | ||
``` | ||
|
||
* `colocation` specifies how two or more allocations influence the location of each other. The colocation constraint has two fields, `type` and `key`. Currently, the only value for `type` is `exclusive`. `key` can be any value. This constraint means that the allocations from an allocation set with the colocation constraint can't be placed on an NNF node with another allocation whose allocation set has a colocation constraint with the same key. Allocations from allocation sets with colocation constraints with different keys or allocation sets without the colocation constraint are okay to put on the same NNF node. | ||
```yaml | ||
constraints: | ||
colocation: | ||
type: exclusive | ||
key: lustre-mgt | ||
``` | ||
|
||
* `count` this field specifies the number of allocations to make when `status.storage.allocationSets.allocationStrategy` is `AllocateAcrossServers` | ||
```yaml | ||
constraints: | ||
count: 5 | ||
``` | ||
|
||
* `scale` is a unitless value from 1-10 that is meant to guide the WLM on how many allocations to make when `status.storage.allocationSets.allocationStrategy` is `AllocateAcrossServers`. The actual number of allocations is not meant to correspond to the value of scale. Rather, 1 would indicate the minimum number of allocations to reach `status.storage.allocationSets.minimumCapacity`, and 10 would be the maximum number of allocations that make sense given the `status.storage.allocationSets.minimumCapacity` and the compute node count. The NNF software does not interpret this value, and it is up to the WLM to define its meaning. | ||
```yaml | ||
constraints: | ||
scale: 8 | ||
``` | ||
|
||
## Compute Nodes | ||
|
||
The `status.compute` section of the `DirectiveBreakdown` describes how the WLM should pick compute nodes for a job. The `status.compute` section will exist only for `jobdw` and `persistentdw` directives. An example of the `status.compute` section is included below. | ||
|
||
```yaml | ||
... | ||
spec: | ||
directive: '#DW jobdw capacity=1TiB type=lustre name=example' | ||
userID: 3450 | ||
status: | ||
... | ||
compute: | ||
constraints: | ||
location: | ||
- access: | ||
- priority: mandatory | ||
type: network | ||
- priority: bestEffort | ||
type: physical | ||
reference: | ||
fieldPath: servers.spec.allocationSets[0] | ||
kind: Servers | ||
name: example-0 | ||
namespace: default | ||
- access: | ||
- priority: mandatory | ||
type: network | ||
reference: | ||
fieldPath: servers.spec.allocationSets[1] | ||
kind: Servers | ||
name: example-0 | ||
namespace: default | ||
... | ||
``` | ||
|
||
The `status.compute.constraints` section lists any constraints on which compute nodes can be used. Currently the only constraint type is the `location` constraint. `status.compute.constraints.location` is a list of location constraints that all must be satisfied. | ||
|
||
A location constraint consists of an `access` list and a `reference`. | ||
|
||
* `status.compute.constraints.location.reference` is an object reference with a `fieldPath` that points to an allocation set in the `Servers` resource. If this is from a `#DW jobdw` directive, the `Servers` resource won't be filled in until the WLM picks storage nodes for the allocations. | ||
* `status.compute.constraints.location.access` is a list that specifies what type of access the compute nodes need to have to the storage allocations in the allocation set. An allocation set may have multiple access types that are required | ||
* `status.compute.constraints.location.access.type` specifies the connection type for the storage. This can be `network` or `physical` | ||
* `status.compute.constraints.location.access.priority` specifies how necessary the connection type is. This can be `mandatory` or `bestEffort` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt, when does an admin create a storage profile that uses 'scale' rather than 'count'? What is the admin trying to accomplish when they do that? The webhook does not allow both to be specified together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count
gives you consistent performance of the file system regardless of how big your job is.scale
gives you file system performance that scales with the number of computes.I think different workload types will benefit from setting the parameters differently.
count
for both OSTs and MDTs since there's no compute job, so the compute node count will be 1 (or 0 if flux adds support for jobs with no compute resources)I think
scale
will be the more used option, butcount
is good for the situations where you want to do something specific and want full control.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of that sounds like useful info to pass on to the WLM developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right. I think I'll add it to the documentation for the corresponding fields in the NnfStorageProfile, though (from the year old PR). That's more of the admin/user facing documentation. That's who would be interested in knowing why you'd want to choose one value over another. The WLM documentation is more about how to take the information NNF provides in the DirectiveBreakdown and use that for scheduling computes and Rabbits.
I'm adding the contents of that old PR to the existing documentation for storage profiles. When that's done, I'll put a link to that section here.