From 025caa32f445b21e7bd889ec497372b076df5598 Mon Sep 17 00:00:00 2001 From: Matt Richerson Date: Fri, 26 Jul 2024 14:05:58 -0500 Subject: [PATCH 1/7] Document job directives and environment variables Signed-off-by: Matt Richerson --- docs/guides/index.md | 1 + docs/guides/user-interactions/readme.md | 182 ++++++++++++++++++++++++ external/nnf-dm | 2 +- mkdocs.yml | 1 + 4 files changed, 185 insertions(+), 1 deletion(-) create mode 100644 docs/guides/user-interactions/readme.md diff --git a/docs/guides/index.md b/docs/guides/index.md index 5d0a83b..96dd22d 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -16,6 +16,7 @@ * [Lustre External MGT](external-mgs/readme.md) * [Global Lustre](global-lustre/readme.md) * [Directive Breakdown](directive-breakdown/readme.md) +* [User Interactions](user-interactions/readme.md) ## NNF User Containers diff --git a/docs/guides/user-interactions/readme.md b/docs/guides/user-interactions/readme.md new file mode 100644 index 0000000..93b7c3c --- /dev/null +++ b/docs/guides/user-interactions/readme.md @@ -0,0 +1,182 @@ +--- +authors: Matt Richerson +categories: provisioning +--- + +# Rabbit User Interactions + +## Overview + +A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes. + +Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information. + +## Directives + +### jobdw + +The `jobdw` directive command tells the Rabbit software to create a file system on the Rabbit hardware for the lifetime of the user's job. At the end of the job, any data that is not moved off of the file system either by the application or through a `copy_out` directive will be lost. Multiple `jobdw` directives can be listed in the same job script. + +#### Command Arguments +| Argument | Required | Value | Notes | +|----------|----------|-------|-------| +| `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. | +| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | +| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job | +| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. | +| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application | + +#### Examples + +``` +#DW jobdw type=xfs capacity=10GiB name=scratch +``` + +This directive results in a 10GiB xfs file system created for each compute node in the job using the default storage profile. + +``` +#DW jobdw type=lustre capacity=1TB name=dw-temp profile=high-metadata +``` + +This directive results in a single 1TB Lustre file system being created that can be accessed from all the compute nodes in the job. It is using a storage profile that an admin created to give high Lustre metadata performance. + +``` +#DW jobdw type=gfs2 capacity=50GB name=checkpoint requires=copy-offload +``` + +This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running. + +### create_persistent + +The `create_persistent` command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single `create_persistent` directive is allowed in a job, and it cannot be in the same job as a `destroy_persistent` directive. + +#### Command Arguments +| Argument | Required | Value | Notes | +|----------|----------|-------|-------| +| `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node in the job. | +| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | +| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within the system | +| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. The profile used when creating the persistent storage allocation is the same profile used by jobs that use the persistent storage. | + +#### Examples + +``` +#DW create_persistent type=xfs capacity=100GiB name=scratch +``` + +This directive results in a 100GiB xfs file system created for each compute node in the job using the default storage profile. Since xfs file systems are not network accessible, subsequent jobs that want to use the file system must have the same number of compute nodes, and be scheduled on compute nodes with access to the correct Rabbit nodes. This means the job with the `create_persistent` directive must schedule the desired number of compute nodes even if no application is run on the compute nodes as part of the job. + +``` +#DW create_persistent type=lustre capacity=10TiB name=shared-data profile=read-only +``` + +This directive results in a single 10TiB Lustre file system being created that can be accessed later by any compute nodes in the system. Multiple jobs can access a Rabbit Lustre file system at the same time. This job can be scheduled with a single compute node (or zero compute nodes if the WLM allows), without any limitations on compute node counts for subsequent jobs using the persistent Lustre file system. + +### destroy_persistent +The `destroy_persistent` command will delete persistent storage that was allocated by a corresponding `create_persistent`. If the persistent storage is currently in use by a job, then the job containing the `destroy_persistent` command will fail. Only a single `destroy_persistent` directive is allowed in a job, and it cannot be in the same job as a `create_persistent` directive. + +#### Command Arguments +| Argument | Required | Value | Notes | +|----------|----------|-------|-------| +| `name` | Yes | String including numbers and '-' | This is a name for the persistent storage allocation that will be destroyed | + +#### Examples + +``` +#DW destroy_persistent name=shared-data +``` + +This directive will delete the persistent storage allocation with the name `shared-data` + +### persistentdw +The `persistentdw` command makes an existing persistent storage allocation available to a job. The persistent storage must already be created from a `create_persistent` command in a different job script. Multiple `persistentdw` commands can be used in the same job script to request access to multiple persistent allocations. + +Persistent Lustre file systems can be accessed from any compute nodes in the system, and the compute node count for the job can vary as needed. Multiple jobs can access a persistent Lustre file system concurrently if desired. Raw, xfs, and GFS2 file systems can only be accessed by compute nodes that have a physical connection to the Rabbits hosting the storage, and jobs accessing these storage types must have the same compute node count as the job that made the persistent storage. + +#### Command Arguments +| Argument | Required | Value | Notes | +|----------|----------|-------|-------| +| `name` | Yes | String including numbers and '-' | This is a name for the persistent storage that will be accessed | +| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application | + +#### Examples + +``` +#DW persistentdw name=shared-data requires=copy-offload +``` + +This directive will cause the `shared-data` persistent storage allocation to be mounted onto the compute nodes for the job application to use. The copy-offload daemon will be started on the compute nodes so the application can request data movement during the application run. + +### copy_in/copy_out + +The `copy_in` and `copy_out` directives are used to move data to and from the storage allocations on Rabbit nodes. The `copy_in` directive requests that data be moved into the Rabbit file system before application launch, and the `copy_out` directive requests data to be moved off of the Rabbit file system after application exit. This is different from data-movement that is requested through the copy-offload API, which occurs during application runtime. Multiple `copy_in` and `copy_out` directives can be included in the same job script. More information about data movement can be found in the [Data Movement](../data-movement/readme.md) documentation. + +#### Command Arguments +| Argument | Required | Value | Notes | +|----------|----------|-------|-------| +| `source` | Yes | `[path]`, `$DW_JOB_[name]/[path]`, `$DW_PERSISTENT_[name]/[path]` | `[name]` is the name of the Rabbit persistent or job storage as specified in the `name` argument of the `jobdw` or `persistentdw` directive. Any `'-'` in the name from the `jobdw` or `persistentdw` directive should be changed to a `'_'` in the `copy_in` and `copy_out` directive. | +| `destination` | Yes | `[path]`, `$DW_JOB_[name]/[path]`, `$DW_PERSISTENT_[name]/[path]` | `[name]` is the name of the Rabbit persistent or job storage as specified in the `name` argument of the `jobdw` or `persistentdw` directive. Any `'-'` in the name from the `jobdw` or `persistentdw` directive should be changed to a `'_'` in the `copy_in` and `copy_out` directive. | +| `profile` | No | Profile name | This specifies which profile to use when copying data. Profiles specify the copy command to use, MPI arguments, and how output gets logged. If no profile is specified then the default profile is used. Profiles are created by an admin. | + +#### Examples + +``` +#DW jobdw type=xfs capacity=10GiB name=fast-storage +#DW copy_in source=/lus/backup/johndoe/important_data destination=$DW_JOB_fast_storage/data +``` + +This set of directives creates an xfs file system on the Rabbits for each compute node in the job, and then moves data from `/lus/backup/johndoe/important_data` to each of the xfs file systems. `/lus/backup` must be set up in the Rabbit software as a [Global Lustre](../global-lustre/readme.md) file system by an admin. The copy takes place before the application is launched on the compute nodes. + +``` +#DW persistentdw name=shared-data1 +#DW persistentdw name=shared-data2 + +#DW copy_out source=$DW_PERSISTENT_shared_data1/a destination=$DW_PERSISTENT_shared_data2/a profile=no-xattr +#DW copy_out source=$DW_PERSISTENT_shared_data1/b destination=$DW_PERSISTENT_shared_data2/b profile=no-xattr +``` + +This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the `no-xattr` profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a guaranteed order. + +``` +#DW persistentdw name=shared-data +#DW jobdw type=lustre capacity=1TiB name=fast-storage profile=high-metadata + +#DW copy_in source=/lus/shared/johndoe/shared-libraries destination=$DW_JOB_fast_storage/libraries +#DW copy_in source=$DW_PERSISTENT_shared_data/ destination=$DW_JOB_fast_storage/data + +#DW copy_out source=$DW_JOB_fast_storage/data destination=/lus/backup/johndoe/very_important_data profile=no-xattr +``` + +This set of directives makes use of a persistent storage allocation and a job storage allocation. There are two `copy_in` directives, one that copies data from the global lustre file system to the job allocation, and another that copies data from the persistent allocation to the job allocation. These copies do not occur in a guaranteed order. The `copy_out` directive occurs after the application has exited, and copies data from the Rabbit job storage to a global lustre file system. + +### container + +The `container` directive is used to launch user containers on the Rabbit nodes. The containers have access to `jobdw`, `persistentdw`, or global Lustre storage as specified in the `container` directive. More documentation for user containers can be found in the [User Containers](../user-containers/readme.md) guide. Only a single `container` directive is allowed in a job. + +#### Command Arguments +| Argument | Required | Value | Notes | +|----------|----------|-------|-------| +| `name` | Yes | String including numbers and '-' | This is a name for the container instance that is unique within a job | +| `profile` | Yes | Profile name | This specifies which container profile to use. The container profile contains information about which container to run, which file system types to expect, which network ports are needed, and many other options. An admin is responsible for creating the container profiles. | +| `DW_JOB_[expected]` | No | `jobdw` storage allocation `name` | The container profile will list `jobdw` file systems that the container requires. `[expected]` is the name as specified in the container profile | +| `DW_PERSISTENT_[expected]` | No | `persistentdw` storage allocation `name` | The container profile will list `persistentdw` file systems that the container requires. `[expected]` is the name as specified in the container profile | +| `DW_GLOBAL_[expected]` | No | Global lustre path | The container profile will list global Lustre file systems that the container requires. `[expected]` is the name as specified in the container profile | + +#### Examples + +``` +#DW jobdw type=xfs capacity=10GiB name=fast-storage +#DW container name=backup profile=automatic-backup DW_JOB_source=fast-storage DW_GLOBAL_destination=/lus/backup/johndoe +``` + +These directives create an xfs Rabbit job allocation and specify a container that should run on the Rabbit nodes. The container profile specified two file systems that the container needs, `DW_JOB_source` and `DW_GLOBAL_destination`. `DW_JOB_source` requires a `jobdw` file system and `DW_GLOBAL_destination` requires a global Lustre file system. + +## Environment Variables + +The WLM makes a set of environment variables available to the job application running on the compute nodes that provide Rabbit specific information. These environment variables are used to find the mount location of Rabbit file systems and port numbers for user containers. + +| Environment Variable | Value | Notes | +|----------------------|-------|-------| +| `DW_JOB_[name]` | Mount path of a `jobdw` file system | `[name]` is from the `name` argument in the `jobdw` directive. Any `'-'` characters in the `name` will be converted to `'_'` in the environment variable. There will be one of these environment variables per `jobdw` directive in the job. | +| `DW_PERSISTENT_[name]` | Mount path of a `persistentdw` file system | `[name]` is from the `name` argument in the `persistentdw` directive. Any `'-'` characters in the `name` will be converted to `'_'` in the environment variable. There will be one of these environment variables per `persistentdw` directive in the job. | +| `NNF_CONTAINER_PORTS` | Comma separated list of ports | These ports are used together with the IP address of the local Rabbit to communicate with a user container specified by a `container` directive. More information can be found in the [User Containers](../user-containers/readme.md) guide. \ No newline at end of file diff --git a/external/nnf-dm b/external/nnf-dm index acb8bf8..e0c318c 160000 --- a/external/nnf-dm +++ b/external/nnf-dm @@ -1 +1 @@ -Subproject commit acb8bf81636a32a892115e82c7a3f6d445e64b1e +Subproject commit e0c318ca1d61325da1d6677a55de432043a8c1e8 diff --git a/mkdocs.yml b/mkdocs.yml index 208575a..258fec7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,6 +10,7 @@ nav: - guides/index.md - 'Initial Setup': 'guides/initial-setup/readme.md' - 'Compute Daemons': 'guides/compute-daemons/readme.md' + - 'User Interactions': 'guides/user-interactions/readme.md' - 'Data Movement': 'guides/data-movement/readme.md' - 'Copy Offload API': 'guides/data-movement/copy-offload-api.html' - 'Firmware Upgrade': 'guides/firmware-upgrade/readme.md' From 245a2d04ea426ca55c188ea7d4cf761decf5fd63 Mon Sep 17 00:00:00 2001 From: Matt Richerson Date: Mon, 29 Jul 2024 14:38:14 -0500 Subject: [PATCH 2/7] review comments Signed-off-by: Matt Richerson --- docs/guides/user-interactions/readme.md | 26 ++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/guides/user-interactions/readme.md b/docs/guides/user-interactions/readme.md index 93b7c3c..62e1bb0 100644 --- a/docs/guides/user-interactions/readme.md +++ b/docs/guides/user-interactions/readme.md @@ -11,7 +11,7 @@ A user may include one or more Data Workflow directives in their job script to r Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information. -## Directives +## Commands ### jobdw @@ -23,8 +23,8 @@ The `jobdw` directive command tells the Rabbit software to create a file system | `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. | | `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | | `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job | -| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. | -| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application | +| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. More information about storage profiles can be found in the [Storage Profiles](../storage-profiles/readme.md) guide. | +| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application. See the [Required Daemons](../directive-breakdown/readme.md#requireddaemons) section of the [Directive Breakdown](../directive-breakdown/readme.md) guide for a description of how the user may request the daemon, in the case where the WLM will run it only on demand. | #### Examples @@ -44,19 +44,19 @@ This directive results in a single 1TB Lustre file system being created that can #DW jobdw type=gfs2 capacity=50GB name=checkpoint requires=copy-offload ``` -This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running. +This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running using the Copy Offload API. ### create_persistent -The `create_persistent` command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single `create_persistent` directive is allowed in a job, and it cannot be in the same job as a `destroy_persistent` directive. +The `create_persistent` command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single `create_persistent` directive is allowed in a job, and it cannot be in the same job as a `destroy_persistent` directive. See [persistentdw](readme.md#persistentdw) to utilize the storage in a job. #### Command Arguments | Argument | Required | Value | Notes | |----------|----------|-------|-------| | `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node in the job. | | `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | -| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within the system | -| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. The profile used when creating the persistent storage allocation is the same profile used by jobs that use the persistent storage. | +| `name` | Yes | Lowercase string including numbers and '-' | This is a name for the storage allocation that is unique within the system | +| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. The profile used when creating the persistent storage allocation is the same profile used by jobs that use the persistent storage. More information about storage profiles can be found in the [Storage Profiles](../storage-profiles/readme.md) guide.| #### Examples @@ -78,7 +78,7 @@ The `destroy_persistent` command will delete persistent storage that was allocat #### Command Arguments | Argument | Required | Value | Notes | |----------|----------|-------|-------| -| `name` | Yes | String including numbers and '-' | This is a name for the persistent storage allocation that will be destroyed | +| `name` | Yes | Lowercase string including numbers and '-' | This is a name for the persistent storage allocation that will be destroyed | #### Examples @@ -96,8 +96,8 @@ Persistent Lustre file systems can be accessed from any compute nodes in the sys #### Command Arguments | Argument | Required | Value | Notes | |----------|----------|-------|-------| -| `name` | Yes | String including numbers and '-' | This is a name for the persistent storage that will be accessed | -| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application | +| `name` | Yes | Lowercase string including numbers and '-' | This is a name for the persistent storage that will be accessed | +| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application. See the [Required Daemons](../directive-breakdown/readme.md#requireddaemons) section of the [Directive Breakdown](../directive-breakdown/readme.md) guide for a description of how the user may request the daemon, in the case where the WLM will run it only on demand. | #### Examples @@ -135,7 +135,7 @@ This set of directives creates an xfs file system on the Rabbits for each comput #DW copy_out source=$DW_PERSISTENT_shared_data1/b destination=$DW_PERSISTENT_shared_data2/b profile=no-xattr ``` -This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the `no-xattr` profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a guaranteed order. +This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the `no-xattr` profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a deterministic order. ``` #DW persistentdw name=shared-data @@ -147,7 +147,7 @@ This set of directives copies two directories from one persistent storage alloca #DW copy_out source=$DW_JOB_fast_storage/data destination=/lus/backup/johndoe/very_important_data profile=no-xattr ``` -This set of directives makes use of a persistent storage allocation and a job storage allocation. There are two `copy_in` directives, one that copies data from the global lustre file system to the job allocation, and another that copies data from the persistent allocation to the job allocation. These copies do not occur in a guaranteed order. The `copy_out` directive occurs after the application has exited, and copies data from the Rabbit job storage to a global lustre file system. +This set of directives makes use of a persistent storage allocation and a job storage allocation. There are two `copy_in` directives, one that copies data from the global lustre file system to the job allocation, and another that copies data from the persistent allocation to the job allocation. These copies do not occur in a deterministic order. The `copy_out` directive occurs after the application has exited, and copies data from the Rabbit job storage to a global lustre file system. ### container @@ -156,7 +156,7 @@ The `container` directive is used to launch user containers on the Rabbit nodes. #### Command Arguments | Argument | Required | Value | Notes | |----------|----------|-------|-------| -| `name` | Yes | String including numbers and '-' | This is a name for the container instance that is unique within a job | +| `name` | Yes | Lowercase string including numbers and '-' | This is a name for the container instance that is unique within a job | | `profile` | Yes | Profile name | This specifies which container profile to use. The container profile contains information about which container to run, which file system types to expect, which network ports are needed, and many other options. An admin is responsible for creating the container profiles. | | `DW_JOB_[expected]` | No | `jobdw` storage allocation `name` | The container profile will list `jobdw` file systems that the container requires. `[expected]` is the name as specified in the container profile | | `DW_PERSISTENT_[expected]` | No | `persistentdw` storage allocation `name` | The container profile will list `persistentdw` file systems that the container requires. `[expected]` is the name as specified in the container profile | From 6fb9eec6295db84a72fddb57561287191ca9f99d Mon Sep 17 00:00:00 2001 From: Dean Roehrich Date: Thu, 1 Aug 2024 15:08:29 -0500 Subject: [PATCH 3/7] Relationship betw cray.nnf.node.drain taint and Storage resource. (#188) Signed-off-by: Dean Roehrich Co-authored-by: Blake Devcich <89158881+bdevcich@users.noreply.github.com> --- docs/guides/index.md | 2 +- docs/guides/node-management/drain.md | 66 ++++++++++++++++++++++++++-- mkdocs.yml | 2 +- 3 files changed, 64 insertions(+), 6 deletions(-) diff --git a/docs/guides/index.md b/docs/guides/index.md index 96dd22d..768d483 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -24,5 +24,5 @@ ## Node Management -* [Draining A Node](node-management/drain.md) +* [Disable or Drain a Node](node-management/drain.md) * [Debugging NVMe Namespaces](node-management/nvme-namespaces.md) diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md index 9256415..8c00a7c 100644 --- a/docs/guides/node-management/drain.md +++ b/docs/guides/node-management/drain.md @@ -1,4 +1,40 @@ -# Draining A Node +# Disable Or Drain A Node + +## Disabling a node + +A Rabbit node can be manually disabled, indicating to the WLM that it should not schedule more jobs on the node. Jobs currently on the node will be allowed to complete at the discretion of the WLM. + +Disable a node by setting its Storage state to `Disabled`. + +```shell +kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Disabled"}]' +``` + +When the Storage is queried by the WLM, it will show the disabled status. + +```console +$ kubectl get storages +NAME STATE STATUS MODE AGE +kind-worker2 Enabled Ready Live 10m +kind-worker3 Disabled Disabled Live 10m +``` + +To re-enable a node, set its Storage state to `Enabled`. + +```shell +kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Enabled"}]' +``` + +The Storage state will show that it is enabled. + +```console +kubectl get storages +NAME STATE STATUS MODE AGE +kind-worker2 Enabled Ready Live 10m +kind-worker3 Enabled Ready Live 10m +``` + +## Draining a node The NNF software consists of a collection of DaemonSets and Deployments. The pods on the Rabbit nodes are usually from DaemonSets. Because of this, the `kubectl drain` @@ -9,7 +45,11 @@ Given the limitations of DaemonSets, the NNF software will be drained by using t as described in [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). -## Drain NNF Pods From A Rabbit Node +This would be used only after the WLM jobs have been removed from that Rabbit (preferably) and there is some reason to also remove the NNF software from it. This might be used before a Rabbit is powered off and pulled out of the cabinet, for example, to avoid leaving pods in "Terminating" state (harmless, but it's noise). + +If an admin used this taint before power-off it would mean there wouldn't be "Terminating" pods lying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on. + +### Drain NNF pods from a rabbit node Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint. The CSI driver pods will remain on the node to satisfy any unmount requests from k8s @@ -19,15 +59,33 @@ as it cleans up the NNF pods. kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute ``` +This will cause the node's `Storage` resource to be drained: + +```console +$ kubectl get storages +NAME STATE STATUS MODE AGE +kind-worker2 Enabled Drained Live 5m44s +kind-worker3 Enabled Ready Live 5m45s +``` + +The `Storage` resource will contain the following message indicating the reason it has been drained: + +```console +$ kubectl get storages rabbit1 -o json | jq -rM .status.message +Kubernetes node is tainted with cray.nnf.node.drain +``` + To restore the node to service, remove the `cray.nnf.node.drain` taint. ```shell kubectl taint node $NODE cray.nnf.node.drain- ``` -## The CSI Driver +The `Storage` resource will revert to a `Ready` status. + +### The CSI driver -While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so. +While the CSI driver pods may be drained from a Rabbit node, it is inadvisable to do so. **Warning** K8s relies on the CSI driver to unmount any filesystems that may have been mounted into a pod's namespace. If it is not present when k8s is attempting diff --git a/mkdocs.yml b/mkdocs.yml index 258fec7..6e0535c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -20,7 +20,7 @@ nav: - 'User Containers': 'guides/user-containers/readme.md' - 'Lustre External MGT': 'guides/external-mgs/readme.md' - 'Global Lustre': 'guides/global-lustre/readme.md' - - 'Draining A Node': 'guides/node-management/drain.md' + - 'Disable or Drain a Node': 'guides/node-management/drain.md' - 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md' - 'Directive Breakdown': 'guides/directive-breakdown/readme.md' - 'RFCs': From 46f93448237a7cc45bc0ddc313a3fb8806f9c03c Mon Sep 17 00:00:00 2001 From: Matt Richerson Date: Fri, 2 Aug 2024 13:20:02 -0500 Subject: [PATCH 4/7] Add NnfLustreMGT documentation Document how to create the NnfLustreMGT and ConfigMap for MGTs outside of NNF's control. Signed-off-by: Matt Richerson --- docs/guides/external-mgs/readme.md | 55 +++++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/docs/guides/external-mgs/readme.md b/docs/guides/external-mgs/readme.md index cc8a553..c058c0d 100644 --- a/docs/guides/external-mgs/readme.md +++ b/docs/guides/external-mgs/readme.md @@ -17,6 +17,7 @@ These three methods are not mutually exclusive on the system as a whole. Individ ## Configuration with an External MGT +### Storage Profile An existing MGT external to the NNF cluster can be used to manage the Lustre file systems on the NNF nodes. An advantage to this configuration is that the MGT can be highly available through multiple MGSs. A disadvantage is that there is only a single MGT. An MGT shared between more than a handful of Lustre file systems is not a common use case, so the Lustre code may prove less stable. The following yaml provides an example of what the `NnfStorageProfile` should contain to use an MGT on an external server. @@ -30,12 +31,64 @@ metadata: data: [...] lustreStorage: - externalMgs: 1.2.3.4@eth0 + externalMgs: 1.2.3.4@eth0:1.2.3.5@eth0 combinedMgtMdt: false standaloneMgtPoolName: "" [...] ``` +### NnfLustreMGT +A `NnfLustreMGT` resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a `NnfLustreMGT` resource. Every MGT must have a corresponding `NnfLustreMGT` resource. For MGTs that are hosted on NNF hardware, the `NnfLustreMGT` resources are created automatically. The NNF software also erases any no longer used fsnames from disk for any internally hosted MGTs. For an MGT hosted on an external node, an admin must create an `NnfLustreMGT`. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the `lctl erase_lcfg [fsname]` command to remove fsnames that are no longer in use. + +Below is an example `NnfLustreMGT` resource. The `NnfLustreMGT` resource for external MGSs should be created in the `nnf-system` namespace. + +```yaml +apiVersion: nnf.cray.hpe.com/v1alpha1 +kind: NnfLustreMGT +metadata: + name: external-mgt + namespace: nnf-system +spec: + addresses: + - "1.2.3.4@eth0:1.2.3.5@eth0" + fsNameStart: "aaaaaaaa" + fsNameBlackList: + - "mylustre" + fsNameStartReference: + name: external-mgt + namespace: default + kind: ConfigMap +``` + +* `addresses` - This is a list of LNet addresses that could be used for this MGT. This should match any values that are used in the `externalMgs` field in the `NnfStorageProfiles`. +* `fsNameStart` - The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g, `aaaaaaaa`, `aaaaaaab`, `aaaaaaac`). fsnames use lowercase letters `'a'`-`'z'`. +* `fsNameBlackList` - This is a list of fsnames that should not be given to any NNF Lustre file systems. If the MGT is hosting any non-NNF Lustre file systems, their fsnames should be included in this blacklist. +* `fsNameStartReference` - This is an optional ObjectReference to a `ConfigMap` that holds a starting fsname. If this field is specified, it takes precedence over the `fsNameStart` field in the spec. The `ConfigMap` will be updated to the next available fsname everytime an fsname is assigned to a new Lustre file system. + +### ConfigMap + +For external MGTs, the `fsNameStartReference` should be used to point to a `ConfigMap` in the default namespace. The `ConfigMap` should not be removed during an argocd undeploy/deploy. This allows the nnf-sos sofware to be undeployed (including any `NnfLustreMGT` resources), without having the fsname reset back to the `fsNameStart` value on a redeploy. The Configmap that is created should be left empty initially. + +### Argocd + +* An empty ConfigMap should be deployed with the `0-early-config` application. It should be created in the `default` namespace, and it can have any name. +* The argocd application for `0-early-config` should be updated to include the following under `ignoreDifferences`: +```yaml + - kind: ConfigMap + jsonPointers: + - /data +``` +* A yaml file for the `NnfLustreMGT` resource should be deployed with the `2-nnf-sos` application. It should be created in the `nnf-system` namespace, and it can have any name. The `ConfigMap` should be listed in the `fsNameStartReference` field. +* The argocd application for `2-nnf-sos` should be updated to include the following under `ignoreDifferences`: +```yaml + - group: nnf.cray.hpe.com + kind: NnfLustreMGT + jsonPointers: + - /spec/claimList +``` + +A separate `ConfigMap` and `NnfLustreMGT` is needed for every external Lustre MGT. + ## Configuration with Persistent Lustre The MGT from a persistent Lustre file system hosted on the NNF nodes can also be used as the MGT for other NNF Lustre file systems. This configuration has the advantage of not relying on any hardware outside of the cluster. However, there is no high availability, and a single MGT is still shared between all Lustre file systems created on the cluster. From 8961d69079d816b58a56be6855f3ee7347ff07ab Mon Sep 17 00:00:00 2001 From: Matt Richerson Date: Fri, 2 Aug 2024 14:54:18 -0500 Subject: [PATCH 5/7] review comments Signed-off-by: Matt Richerson --- docs/guides/external-mgs/readme.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/guides/external-mgs/readme.md b/docs/guides/external-mgs/readme.md index c058c0d..42e41bb 100644 --- a/docs/guides/external-mgs/readme.md +++ b/docs/guides/external-mgs/readme.md @@ -38,7 +38,8 @@ data: ``` ### NnfLustreMGT -A `NnfLustreMGT` resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a `NnfLustreMGT` resource. Every MGT must have a corresponding `NnfLustreMGT` resource. For MGTs that are hosted on NNF hardware, the `NnfLustreMGT` resources are created automatically. The NNF software also erases any no longer used fsnames from disk for any internally hosted MGTs. For an MGT hosted on an external node, an admin must create an `NnfLustreMGT`. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the `lctl erase_lcfg [fsname]` command to remove fsnames that are no longer in use. + +A `NnfLustreMGT` resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a `NnfLustreMGT` resource. Every MGT must have a corresponding `NnfLustreMGT` resource. For MGTs that are hosted on NNF hardware, the `NnfLustreMGT` resources are created automatically. The NNF software also erases any unused fsnames from the MGT disk for any internally hosted MGTs. For a MGT hosted on an external node, an admin must create an `NnfLustreMGT` resource. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the `lctl erase_lcfg [fsname]` command to remove fsnames that are no longer in use. Below is an example `NnfLustreMGT` resource. The `NnfLustreMGT` resource for external MGSs should be created in the `nnf-system` namespace. @@ -61,9 +62,9 @@ spec: ``` * `addresses` - This is a list of LNet addresses that could be used for this MGT. This should match any values that are used in the `externalMgs` field in the `NnfStorageProfiles`. -* `fsNameStart` - The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g, `aaaaaaaa`, `aaaaaaab`, `aaaaaaac`). fsnames use lowercase letters `'a'`-`'z'`. +* `fsNameStart` - The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g, `aaaaaaaa`, `aaaaaaab`, `aaaaaaac`). fsnames use lowercase letters `'a'`-`'z'`. `fsNameStart` should be exactly 8 characters long. * `fsNameBlackList` - This is a list of fsnames that should not be given to any NNF Lustre file systems. If the MGT is hosting any non-NNF Lustre file systems, their fsnames should be included in this blacklist. -* `fsNameStartReference` - This is an optional ObjectReference to a `ConfigMap` that holds a starting fsname. If this field is specified, it takes precedence over the `fsNameStart` field in the spec. The `ConfigMap` will be updated to the next available fsname everytime an fsname is assigned to a new Lustre file system. +* `fsNameStartReference` - This is an optional `ObjectReference` to a `ConfigMap` that holds a starting fsname. If this field is specified, it takes precedence over the `fsNameStart` field in the spec. The `ConfigMap` will be updated to the next available fsname everytime an fsname is assigned to a new Lustre file system. ### ConfigMap From d6851627e0cbff6dca82568d6dffd78e9a14f347 Mon Sep 17 00:00:00 2001 From: Matt Richerson Date: Tue, 20 Aug 2024 11:51:38 -0500 Subject: [PATCH 6/7] review comments Signed-off-by: Matt Richerson --- docs/guides/external-mgs/readme.md | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-) diff --git a/docs/guides/external-mgs/readme.md b/docs/guides/external-mgs/readme.md index 42e41bb..b8e0074 100644 --- a/docs/guides/external-mgs/readme.md +++ b/docs/guides/external-mgs/readme.md @@ -39,9 +39,11 @@ data: ### NnfLustreMGT -A `NnfLustreMGT` resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a `NnfLustreMGT` resource. Every MGT must have a corresponding `NnfLustreMGT` resource. For MGTs that are hosted on NNF hardware, the `NnfLustreMGT` resources are created automatically. The NNF software also erases any unused fsnames from the MGT disk for any internally hosted MGTs. For a MGT hosted on an external node, an admin must create an `NnfLustreMGT` resource. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the `lctl erase_lcfg [fsname]` command to remove fsnames that are no longer in use. +A `NnfLustreMGT` resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a `NnfLustreMGT` resource. Every MGT must have a corresponding `NnfLustreMGT` resource. For MGTs that are hosted on NNF hardware, the `NnfLustreMGT` resources are created automatically. The NNF software also erases any unused fsnames from the MGT disk for any internally hosted MGTs. -Below is an example `NnfLustreMGT` resource. The `NnfLustreMGT` resource for external MGSs should be created in the `nnf-system` namespace. +For a MGT hosted on an external node, an admin must create an `NnfLustreMGT` resource. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the `lctl erase_lcfg [fsname]` command to remove fsnames that are no longer in use. + +Below is an example `NnfLustreMGT` resource. The `NnfLustreMGT` resource for external MGSs must be created in the `nnf-system` namespace. ```yaml apiVersion: nnf.cray.hpe.com/v1alpha1 @@ -64,31 +66,13 @@ spec: * `addresses` - This is a list of LNet addresses that could be used for this MGT. This should match any values that are used in the `externalMgs` field in the `NnfStorageProfiles`. * `fsNameStart` - The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g, `aaaaaaaa`, `aaaaaaab`, `aaaaaaac`). fsnames use lowercase letters `'a'`-`'z'`. `fsNameStart` should be exactly 8 characters long. * `fsNameBlackList` - This is a list of fsnames that should not be given to any NNF Lustre file systems. If the MGT is hosting any non-NNF Lustre file systems, their fsnames should be included in this blacklist. -* `fsNameStartReference` - This is an optional `ObjectReference` to a `ConfigMap` that holds a starting fsname. If this field is specified, it takes precedence over the `fsNameStart` field in the spec. The `ConfigMap` will be updated to the next available fsname everytime an fsname is assigned to a new Lustre file system. +* `fsNameStartReference` - This is an optional `ObjectReference` to a `ConfigMap` that holds a starting fsname. If this field is specified, it takes precedence over the `fsNameStart` field in the spec. The `ConfigMap` will be updated to the next available fsname every time an fsname is assigned to a new Lustre file system. ### ConfigMap -For external MGTs, the `fsNameStartReference` should be used to point to a `ConfigMap` in the default namespace. The `ConfigMap` should not be removed during an argocd undeploy/deploy. This allows the nnf-sos sofware to be undeployed (including any `NnfLustreMGT` resources), without having the fsname reset back to the `fsNameStart` value on a redeploy. The Configmap that is created should be left empty initially. - -### Argocd - -* An empty ConfigMap should be deployed with the `0-early-config` application. It should be created in the `default` namespace, and it can have any name. -* The argocd application for `0-early-config` should be updated to include the following under `ignoreDifferences`: -```yaml - - kind: ConfigMap - jsonPointers: - - /data -``` -* A yaml file for the `NnfLustreMGT` resource should be deployed with the `2-nnf-sos` application. It should be created in the `nnf-system` namespace, and it can have any name. The `ConfigMap` should be listed in the `fsNameStartReference` field. -* The argocd application for `2-nnf-sos` should be updated to include the following under `ignoreDifferences`: -```yaml - - group: nnf.cray.hpe.com - kind: NnfLustreMGT - jsonPointers: - - /spec/claimList -``` +For external MGTs, the `fsNameStartReference` should be used to point to a `ConfigMap` in the `default` namespace. The `ConfigMap` should be left empty initially. The `ConfigMap` is used to hold the value of the next available fsname, and it should not be deleted or modified while a `NnfLustreMGT` resource is referencing it. Removing the `ConfigMap` will cause the Rabbit software to lose track of which fsnames have already been used on the MGT. This is undesireable unless the external MGT is no longer being used by Rabbit software or if an admin has erased all previously used fsnames with the `lctl erase_lcfg [fsname]` command. -A separate `ConfigMap` and `NnfLustreMGT` is needed for every external Lustre MGT. +When using the `ConfigMap`, the nnf-sos software may be undeployed and redeployed without losing track of the next fsname value. During an undeploy, the `NnfLustreMGT` resource will be removed. During a deploy, the `NnfLustreMGT` resource will read the fsname value from the `ConfigMap` if it is present. The value in the `ConfigMap` will override the fsname in the `fsNameStart` field. ## Configuration with Persistent Lustre From b07cbf2acf649de95f739ad8b267c10e61e11ab2 Mon Sep 17 00:00:00 2001 From: Anthony Floeder Date: Wed, 21 Aug 2024 14:37:09 -0500 Subject: [PATCH 7/7] update submodule Signed-off-by: Anthony Floeder --- external/nnf-dm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/external/nnf-dm b/external/nnf-dm index e0c318c..45a98ea 160000 --- a/external/nnf-dm +++ b/external/nnf-dm @@ -1 +1 @@ -Subproject commit e0c318ca1d61325da1d6677a55de432043a8c1e8 +Subproject commit 45a98ea47bf84c0d2e1c2408f600557764c27fc2