diff --git a/dev/404.html b/dev/404.html
index 751532b..233ecbc 100644
--- a/dev/404.html
+++ b/dev/404.html
@@ -14,7 +14,7 @@
       
       
       <link rel="icon" href="/img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/compute-daemons/readme/index.html b/dev/guides/compute-daemons/readme/index.html
index 04f755c..0a81ceb 100644
--- a/dev/guides/compute-daemons/readme/index.html
+++ b/dev/guides/compute-daemons/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/data-movement/readme/index.html b/dev/guides/data-movement/readme/index.html
index 952e6be..f8677b5 100644
--- a/dev/guides/data-movement/readme/index.html
+++ b/dev/guides/data-movement/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/directive-breakdown/readme/index.html b/dev/guides/directive-breakdown/readme/index.html
index b01bce4..e56cf76 100644
--- a/dev/guides/directive-breakdown/readme/index.html
+++ b/dev/guides/directive-breakdown/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/external-mgs/readme/index.html b/dev/guides/external-mgs/readme/index.html
index ae29e6b..12f11bd 100644
--- a/dev/guides/external-mgs/readme/index.html
+++ b/dev/guides/external-mgs/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
@@ -660,6 +660,48 @@
     </span>
   </a>
   
+    <nav class="md-nav" aria-label="Configuration with an External MGT">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#storage-profile" class="md-nav__link">
+    <span class="md-ellipsis">
+      Storage Profile
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#nnflustremgt" class="md-nav__link">
+    <span class="md-ellipsis">
+      NnfLustreMGT
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#configmap" class="md-nav__link">
+    <span class="md-ellipsis">
+      ConfigMap
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#argocd" class="md-nav__link">
+    <span class="md-ellipsis">
+      Argocd
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
 </li>
       
         <li class="md-nav__item">
@@ -983,6 +1025,48 @@
     </span>
   </a>
   
+    <nav class="md-nav" aria-label="Configuration with an External MGT">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#storage-profile" class="md-nav__link">
+    <span class="md-ellipsis">
+      Storage Profile
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#nnflustremgt" class="md-nav__link">
+    <span class="md-ellipsis">
+      NnfLustreMGT
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#configmap" class="md-nav__link">
+    <span class="md-ellipsis">
+      ConfigMap
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#argocd" class="md-nav__link">
+    <span class="md-ellipsis">
+      Argocd
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
 </li>
       
         <li class="md-nav__item">
@@ -1031,6 +1115,7 @@ <h2 id="background">Background</h2>
 </ol>
 <p>These three methods are not mutually exclusive on the system as a whole. Individual file systems can use any of options 1-3 or create their own MGT.</p>
 <h2 id="configuration-with-an-external-mgt">Configuration with an External MGT</h2>
+<h3 id="storage-profile">Storage Profile</h3>
 <p>An existing MGT external to the NNF cluster can be used to manage the Lustre file systems on the NNF nodes. An advantage to this configuration is that the MGT can be highly available through multiple MGSs. A disadvantage is that there is only a single MGT. An MGT shared between more than a handful of Lustre file systems is not a common use case, so the Lustre code may prove less stable.</p>
 <p>The following yaml provides an example of what the <code>NnfStorageProfile</code> should contain to use an MGT on an external server.</p>
 <div class="highlight"><pre><span></span><code><a id="__codelineno-0-1" name="__codelineno-0-1" href="#__codelineno-0-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
@@ -1041,92 +1126,136 @@ <h2 id="configuration-with-an-external-mgt">Configuration with an External MGT</
 <a id="__codelineno-0-6" name="__codelineno-0-6" href="#__codelineno-0-6"></a><span class="nt">data</span><span class="p">:</span>
 <a id="__codelineno-0-7" name="__codelineno-0-7" href="#__codelineno-0-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 <a id="__codelineno-0-8" name="__codelineno-0-8" href="#__codelineno-0-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
-<a id="__codelineno-0-9" name="__codelineno-0-9" href="#__codelineno-0-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1.2.3.4@eth0</span>
+<a id="__codelineno-0-9" name="__codelineno-0-9" href="#__codelineno-0-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1.2.3.4@eth0:1.2.3.5@eth0</span>
 <a id="__codelineno-0-10" name="__codelineno-0-10" href="#__codelineno-0-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
 <a id="__codelineno-0-11" name="__codelineno-0-11" href="#__codelineno-0-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
 <a id="__codelineno-0-12" name="__codelineno-0-12" href="#__codelineno-0-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 </code></pre></div>
-<h2 id="configuration-with-persistent-lustre">Configuration with Persistent Lustre</h2>
-<p>The MGT from a persistent Lustre file system hosted on the NNF nodes can also be used as the MGT for other NNF Lustre file systems. This configuration has the advantage of not relying on any hardware outside of the cluster. However, there is no high availability, and a single MGT is still shared between all Lustre file systems created on the cluster.</p>
-<p>To configure a persistent Lustre file system that can share its MGT, a <code>NnfStorageProfile</code> should be used that does not specify <code>externalMgs</code>. The MGT can either share a volume with the MDT or not (<code>combinedMgtMdt</code>).</p>
+<h3 id="nnflustremgt">NnfLustreMGT</h3>
+<p>A <code>NnfLustreMGT</code> resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a <code>NnfLustreMGT</code> resource. Every MGT must have a corresponding <code>NnfLustreMGT</code> resource. For MGTs that are hosted on NNF hardware, the <code>NnfLustreMGT</code> resources are created automatically. The NNF software also erases any no longer used fsnames from disk for any internally hosted MGTs. For an MGT hosted on an external node, an admin must create an <code>NnfLustreMGT</code>. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the <code>lctl erase_lcfg [fsname]</code> command to remove fsnames that are no longer in use.</p>
+<p>Below is an example <code>NnfLustreMGT</code> resource. The <code>NnfLustreMGT</code> resource for external MGSs should be created in the <code>nnf-system</code> namespace.</p>
 <div class="highlight"><pre><span></span><code><a id="__codelineno-1-1" name="__codelineno-1-1" href="#__codelineno-1-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
-<a id="__codelineno-1-2" name="__codelineno-1-2" href="#__codelineno-1-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorageProfile</span>
+<a id="__codelineno-1-2" name="__codelineno-1-2" href="#__codelineno-1-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfLustreMGT</span>
 <a id="__codelineno-1-3" name="__codelineno-1-3" href="#__codelineno-1-3"></a><span class="nt">metadata</span><span class="p">:</span>
-<a id="__codelineno-1-4" name="__codelineno-1-4" href="#__codelineno-1-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">persistent-lustre-shared-mgt</span>
+<a id="__codelineno-1-4" name="__codelineno-1-4" href="#__codelineno-1-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">external-mgt</span>
 <a id="__codelineno-1-5" name="__codelineno-1-5" href="#__codelineno-1-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf-system</span>
-<a id="__codelineno-1-6" name="__codelineno-1-6" href="#__codelineno-1-6"></a><span class="nt">data</span><span class="p">:</span>
-<a id="__codelineno-1-7" name="__codelineno-1-7" href="#__codelineno-1-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
-<a id="__codelineno-1-8" name="__codelineno-1-8" href="#__codelineno-1-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
-<a id="__codelineno-1-9" name="__codelineno-1-9" href="#__codelineno-1-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
-<a id="__codelineno-1-10" name="__codelineno-1-10" href="#__codelineno-1-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
-<a id="__codelineno-1-11" name="__codelineno-1-11" href="#__codelineno-1-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
-<a id="__codelineno-1-12" name="__codelineno-1-12" href="#__codelineno-1-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
-</code></pre></div>
-<p>The persistent storage is created with the following DW directive:</p>
-<div class="highlight"><pre><span></span><code><a id="__codelineno-2-1" name="__codelineno-2-1" href="#__codelineno-2-1"></a><span class="c1">#DW create_persistent name=shared-lustre capacity=100GiB type=lustre profile=persistent-lustre-shared-mgt</span>
-</code></pre></div>
-<p>After the persistent Lustre file system is created, an admin can discover the MGS address by looking at the <code>NnfStorage</code> resource with the same name as the persistent storage that was created (<code>shared-lustre</code> in the above example).</p>
-<div class="highlight"><pre><span></span><code><a id="__codelineno-3-1" name="__codelineno-3-1" href="#__codelineno-3-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
-<a id="__codelineno-3-2" name="__codelineno-3-2" href="#__codelineno-3-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorage</span>
-<a id="__codelineno-3-3" name="__codelineno-3-3" href="#__codelineno-3-3"></a><span class="nt">metadata</span><span class="p">:</span>
-<a id="__codelineno-3-4" name="__codelineno-3-4" href="#__codelineno-3-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">shared-lustre</span>
-<a id="__codelineno-3-5" name="__codelineno-3-5" href="#__codelineno-3-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">default</span>
-<a id="__codelineno-3-6" name="__codelineno-3-6" href="#__codelineno-3-6"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
-<a id="__codelineno-3-7" name="__codelineno-3-7" href="#__codelineno-3-7"></a><span class="nt">status</span><span class="p">:</span>
-<a id="__codelineno-3-8" name="__codelineno-3-8" href="#__codelineno-3-8"></a><span class="w">  </span><span class="nt">mgsNode</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">5.6.7.8@eth1</span>
-<a id="__codelineno-3-9" name="__codelineno-3-9" href="#__codelineno-3-9"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+<a id="__codelineno-1-6" name="__codelineno-1-6" href="#__codelineno-1-6"></a><span class="nt">spec</span><span class="p">:</span>
+<a id="__codelineno-1-7" name="__codelineno-1-7" href="#__codelineno-1-7"></a><span class="w">  </span><span class="nt">addresses</span><span class="p">:</span>
+<a id="__codelineno-1-8" name="__codelineno-1-8" href="#__codelineno-1-8"></a><span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&quot;1.2.3.4@eth0:1.2.3.5@eth0&quot;</span>
+<a id="__codelineno-1-9" name="__codelineno-1-9" href="#__codelineno-1-9"></a><span class="w">  </span><span class="nt">fsNameStart</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;aaaaaaaa&quot;</span>
+<a id="__codelineno-1-10" name="__codelineno-1-10" href="#__codelineno-1-10"></a><span class="w">  </span><span class="nt">fsNameBlackList</span><span class="p">:</span>
+<a id="__codelineno-1-11" name="__codelineno-1-11" href="#__codelineno-1-11"></a><span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&quot;mylustre&quot;</span>
+<a id="__codelineno-1-12" name="__codelineno-1-12" href="#__codelineno-1-12"></a><span class="w">  </span><span class="nt">fsNameStartReference</span><span class="p">:</span>
+<a id="__codelineno-1-13" name="__codelineno-1-13" href="#__codelineno-1-13"></a><span class="w">    </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">external-mgt</span>
+<a id="__codelineno-1-14" name="__codelineno-1-14" href="#__codelineno-1-14"></a><span class="w">    </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">default</span>
+<a id="__codelineno-1-15" name="__codelineno-1-15" href="#__codelineno-1-15"></a><span class="w">    </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ConfigMap</span>
 </code></pre></div>
-<p>A separate <code>NnfStorageProfile</code> can be created that specifies the MGS address.</p>
+<ul>
+<li><code>addresses</code> - This is a list of LNet addresses that could be used for this MGT. This should match any values that are used in the <code>externalMgs</code> field in the <code>NnfStorageProfiles</code>.</li>
+<li><code>fsNameStart</code> - The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g, <code>aaaaaaaa</code>, <code>aaaaaaab</code>, <code>aaaaaaac</code>). fsnames use lowercase letters <code>'a'</code>-<code>'z'</code>.</li>
+<li><code>fsNameBlackList</code> - This is a list of fsnames that should not be given to any NNF Lustre file systems. If the MGT is hosting any non-NNF Lustre file systems, their fsnames should be included in this blacklist.</li>
+<li><code>fsNameStartReference</code> - This is an optional ObjectReference to a <code>ConfigMap</code> that holds a starting fsname. If this field is specified, it takes precedence over the <code>fsNameStart</code> field in the spec. The <code>ConfigMap</code> will be updated to the next available fsname everytime an fsname is assigned to a new Lustre file system.</li>
+</ul>
+<h3 id="configmap">ConfigMap</h3>
+<p>For external MGTs, the <code>fsNameStartReference</code> should be used to point to a <code>ConfigMap</code> in the default namespace. The <code>ConfigMap</code> should not be removed during an argocd undeploy/deploy. This allows the nnf-sos sofware to be undeployed (including any <code>NnfLustreMGT</code> resources), without having the fsname reset back to the <code>fsNameStart</code> value on a redeploy. The Configmap that is created should be left empty initially.</p>
+<h3 id="argocd">Argocd</h3>
+<ul>
+<li>An empty ConfigMap should be deployed with the <code>0-early-config</code> application.</li>
+<li>The argocd application for <code>0-early-config</code> should be updated to include the following under <code>ignoreDifferences</code>:
+<div class="highlight"><pre><span></span><code><a id="__codelineno-2-1" name="__codelineno-2-1" href="#__codelineno-2-1"></a><span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ConfigMap</span>
+<a id="__codelineno-2-2" name="__codelineno-2-2" href="#__codelineno-2-2"></a><span class="w">    </span><span class="nt">jsonPointers</span><span class="p">:</span>
+<a id="__codelineno-2-3" name="__codelineno-2-3" href="#__codelineno-2-3"></a><span class="w">    </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/data</span>
+</code></pre></div></li>
+<li>A yaml file for the <code>NnfLustreMGT</code> resource should be deployed with the <code>2-nnf-sos</code> application. It should be created in the <code>nnf-system</code> namespace, and it can have any name. The <code>ConfigMap</code> should be listed in the <code>fsNameStartReference</code> field.</li>
+<li>The argocd application for <code>2-nnf-sos</code> should be updated to include the following under <code>ignoreDifferences</code>:
+<div class="highlight"><pre><span></span><code><a id="__codelineno-3-1" name="__codelineno-3-1" href="#__codelineno-3-1"></a><span class="w">  </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">group</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com</span>
+<a id="__codelineno-3-2" name="__codelineno-3-2" href="#__codelineno-3-2"></a><span class="w">    </span><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfLustreMGT</span>
+<a id="__codelineno-3-3" name="__codelineno-3-3" href="#__codelineno-3-3"></a><span class="w">    </span><span class="nt">jsonPointers</span><span class="p">:</span>
+<a id="__codelineno-3-4" name="__codelineno-3-4" href="#__codelineno-3-4"></a><span class="w">    </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/spec/claimList</span>
+</code></pre></div></li>
+</ul>
+<p>A separate <code>ConfigMap</code> and <code>NnfLustreMGT</code> is needed for every external Lustre MGT.</p>
+<h2 id="configuration-with-persistent-lustre">Configuration with Persistent Lustre</h2>
+<p>The MGT from a persistent Lustre file system hosted on the NNF nodes can also be used as the MGT for other NNF Lustre file systems. This configuration has the advantage of not relying on any hardware outside of the cluster. However, there is no high availability, and a single MGT is still shared between all Lustre file systems created on the cluster.</p>
+<p>To configure a persistent Lustre file system that can share its MGT, a <code>NnfStorageProfile</code> should be used that does not specify <code>externalMgs</code>. The MGT can either share a volume with the MDT or not (<code>combinedMgtMdt</code>).</p>
 <div class="highlight"><pre><span></span><code><a id="__codelineno-4-1" name="__codelineno-4-1" href="#__codelineno-4-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
 <a id="__codelineno-4-2" name="__codelineno-4-2" href="#__codelineno-4-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorageProfile</span>
 <a id="__codelineno-4-3" name="__codelineno-4-3" href="#__codelineno-4-3"></a><span class="nt">metadata</span><span class="p">:</span>
-<a id="__codelineno-4-4" name="__codelineno-4-4" href="#__codelineno-4-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">internal-mgt</span>
+<a id="__codelineno-4-4" name="__codelineno-4-4" href="#__codelineno-4-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">persistent-lustre-shared-mgt</span>
 <a id="__codelineno-4-5" name="__codelineno-4-5" href="#__codelineno-4-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf-system</span>
 <a id="__codelineno-4-6" name="__codelineno-4-6" href="#__codelineno-4-6"></a><span class="nt">data</span><span class="p">:</span>
 <a id="__codelineno-4-7" name="__codelineno-4-7" href="#__codelineno-4-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 <a id="__codelineno-4-8" name="__codelineno-4-8" href="#__codelineno-4-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
-<a id="__codelineno-4-9" name="__codelineno-4-9" href="#__codelineno-4-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">5.6.7.8@eth1</span>
+<a id="__codelineno-4-9" name="__codelineno-4-9" href="#__codelineno-4-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
 <a id="__codelineno-4-10" name="__codelineno-4-10" href="#__codelineno-4-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
 <a id="__codelineno-4-11" name="__codelineno-4-11" href="#__codelineno-4-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
 <a id="__codelineno-4-12" name="__codelineno-4-12" href="#__codelineno-4-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 </code></pre></div>
-<p>With this configuration, an admin must determine that no file systems are using the shared MGT before destroying the persistent Lustre instance.</p>
-<h2 id="configuration-with-an-internal-mgt-pool">Configuration with an Internal MGT Pool</h2>
-<p>Another method NNF supports is to create a number of persistent Lustre MGTs on NNF nodes. These MGTs are not part of a full file system, but are instead added to a pool of MGTs available for other Lustre file systems to use. Lustre file systems that are created will choose one of the MGTs at random to use and add a reference to make sure it isn't destroyed. This configuration has the advantage of spreading the Lustre management load across multiple servers. The disadvantage of this configuration is that it does not provide high availability.</p>
-<p>To configure the system this way, the first step is to make a pool of Lustre MGTs. This is done by creating a persistent instance from a storage profile that specifies the <code>standaloneMgtPoolName</code> option. This option tells NNF software to only create an MGT, and to add it to a named pool. The following <code>NnfStorageProfile</code> provides an example where the MGT is added to the <code>example-pool</code> pool:</p>
-<div class="highlight"><pre><span></span><code><a id="__codelineno-5-1" name="__codelineno-5-1" href="#__codelineno-5-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
-<a id="__codelineno-5-2" name="__codelineno-5-2" href="#__codelineno-5-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorageProfile</span>
-<a id="__codelineno-5-3" name="__codelineno-5-3" href="#__codelineno-5-3"></a><span class="nt">metadata</span><span class="p">:</span>
-<a id="__codelineno-5-4" name="__codelineno-5-4" href="#__codelineno-5-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">mgt-pool-member</span>
-<a id="__codelineno-5-5" name="__codelineno-5-5" href="#__codelineno-5-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf-system</span>
-<a id="__codelineno-5-6" name="__codelineno-5-6" href="#__codelineno-5-6"></a><span class="nt">data</span><span class="p">:</span>
-<a id="__codelineno-5-7" name="__codelineno-5-7" href="#__codelineno-5-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
-<a id="__codelineno-5-8" name="__codelineno-5-8" href="#__codelineno-5-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
-<a id="__codelineno-5-9" name="__codelineno-5-9" href="#__codelineno-5-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
-<a id="__codelineno-5-10" name="__codelineno-5-10" href="#__codelineno-5-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
-<a id="__codelineno-5-11" name="__codelineno-5-11" href="#__codelineno-5-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;example-pool&quot;</span>
-<a id="__codelineno-5-12" name="__codelineno-5-12" href="#__codelineno-5-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+<p>The persistent storage is created with the following DW directive:</p>
+<div class="highlight"><pre><span></span><code><a id="__codelineno-5-1" name="__codelineno-5-1" href="#__codelineno-5-1"></a><span class="c1">#DW create_persistent name=shared-lustre capacity=100GiB type=lustre profile=persistent-lustre-shared-mgt</span>
 </code></pre></div>
-<p>A persistent storage MGTs can be created with the following DW directive:</p>
-<div class="highlight"><pre><span></span><code><a id="__codelineno-6-1" name="__codelineno-6-1" href="#__codelineno-6-1"></a><span class="c1">#DW create_persistent name=mgt-pool-member-1 capacity=1GiB type=lustre profile=mgt-pool-member</span>
+<p>After the persistent Lustre file system is created, an admin can discover the MGS address by looking at the <code>NnfStorage</code> resource with the same name as the persistent storage that was created (<code>shared-lustre</code> in the above example).</p>
+<div class="highlight"><pre><span></span><code><a id="__codelineno-6-1" name="__codelineno-6-1" href="#__codelineno-6-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
+<a id="__codelineno-6-2" name="__codelineno-6-2" href="#__codelineno-6-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorage</span>
+<a id="__codelineno-6-3" name="__codelineno-6-3" href="#__codelineno-6-3"></a><span class="nt">metadata</span><span class="p">:</span>
+<a id="__codelineno-6-4" name="__codelineno-6-4" href="#__codelineno-6-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">shared-lustre</span>
+<a id="__codelineno-6-5" name="__codelineno-6-5" href="#__codelineno-6-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">default</span>
+<a id="__codelineno-6-6" name="__codelineno-6-6" href="#__codelineno-6-6"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+<a id="__codelineno-6-7" name="__codelineno-6-7" href="#__codelineno-6-7"></a><span class="nt">status</span><span class="p">:</span>
+<a id="__codelineno-6-8" name="__codelineno-6-8" href="#__codelineno-6-8"></a><span class="w">  </span><span class="nt">mgsNode</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">5.6.7.8@eth1</span>
+<a id="__codelineno-6-9" name="__codelineno-6-9" href="#__codelineno-6-9"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 </code></pre></div>
-<p>Multiple persistent instances with different names can be created using the <code>mgt-pool-member</code> profile to add more than one MGT to the pool.</p>
-<p>To create a Lustre file system that uses one of the MGTs from the pool, an <code>NnfStorageProfile</code> should be created that uses the special notation <code>pool:[pool-name]</code> in the <code>externalMgs</code> field.</p>
+<p>A separate <code>NnfStorageProfile</code> can be created that specifies the MGS address.</p>
 <div class="highlight"><pre><span></span><code><a id="__codelineno-7-1" name="__codelineno-7-1" href="#__codelineno-7-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
 <a id="__codelineno-7-2" name="__codelineno-7-2" href="#__codelineno-7-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorageProfile</span>
 <a id="__codelineno-7-3" name="__codelineno-7-3" href="#__codelineno-7-3"></a><span class="nt">metadata</span><span class="p">:</span>
-<a id="__codelineno-7-4" name="__codelineno-7-4" href="#__codelineno-7-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">mgt-pool-consumer</span>
+<a id="__codelineno-7-4" name="__codelineno-7-4" href="#__codelineno-7-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">internal-mgt</span>
 <a id="__codelineno-7-5" name="__codelineno-7-5" href="#__codelineno-7-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf-system</span>
 <a id="__codelineno-7-6" name="__codelineno-7-6" href="#__codelineno-7-6"></a><span class="nt">data</span><span class="p">:</span>
 <a id="__codelineno-7-7" name="__codelineno-7-7" href="#__codelineno-7-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 <a id="__codelineno-7-8" name="__codelineno-7-8" href="#__codelineno-7-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
-<a id="__codelineno-7-9" name="__codelineno-7-9" href="#__codelineno-7-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;pool:example-pool&quot;</span>
+<a id="__codelineno-7-9" name="__codelineno-7-9" href="#__codelineno-7-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">5.6.7.8@eth1</span>
 <a id="__codelineno-7-10" name="__codelineno-7-10" href="#__codelineno-7-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
 <a id="__codelineno-7-11" name="__codelineno-7-11" href="#__codelineno-7-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
 <a id="__codelineno-7-12" name="__codelineno-7-12" href="#__codelineno-7-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
 </code></pre></div>
+<p>With this configuration, an admin must determine that no file systems are using the shared MGT before destroying the persistent Lustre instance.</p>
+<h2 id="configuration-with-an-internal-mgt-pool">Configuration with an Internal MGT Pool</h2>
+<p>Another method NNF supports is to create a number of persistent Lustre MGTs on NNF nodes. These MGTs are not part of a full file system, but are instead added to a pool of MGTs available for other Lustre file systems to use. Lustre file systems that are created will choose one of the MGTs at random to use and add a reference to make sure it isn't destroyed. This configuration has the advantage of spreading the Lustre management load across multiple servers. The disadvantage of this configuration is that it does not provide high availability.</p>
+<p>To configure the system this way, the first step is to make a pool of Lustre MGTs. This is done by creating a persistent instance from a storage profile that specifies the <code>standaloneMgtPoolName</code> option. This option tells NNF software to only create an MGT, and to add it to a named pool. The following <code>NnfStorageProfile</code> provides an example where the MGT is added to the <code>example-pool</code> pool:</p>
+<div class="highlight"><pre><span></span><code><a id="__codelineno-8-1" name="__codelineno-8-1" href="#__codelineno-8-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
+<a id="__codelineno-8-2" name="__codelineno-8-2" href="#__codelineno-8-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorageProfile</span>
+<a id="__codelineno-8-3" name="__codelineno-8-3" href="#__codelineno-8-3"></a><span class="nt">metadata</span><span class="p">:</span>
+<a id="__codelineno-8-4" name="__codelineno-8-4" href="#__codelineno-8-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">mgt-pool-member</span>
+<a id="__codelineno-8-5" name="__codelineno-8-5" href="#__codelineno-8-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf-system</span>
+<a id="__codelineno-8-6" name="__codelineno-8-6" href="#__codelineno-8-6"></a><span class="nt">data</span><span class="p">:</span>
+<a id="__codelineno-8-7" name="__codelineno-8-7" href="#__codelineno-8-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+<a id="__codelineno-8-8" name="__codelineno-8-8" href="#__codelineno-8-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
+<a id="__codelineno-8-9" name="__codelineno-8-9" href="#__codelineno-8-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
+<a id="__codelineno-8-10" name="__codelineno-8-10" href="#__codelineno-8-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
+<a id="__codelineno-8-11" name="__codelineno-8-11" href="#__codelineno-8-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;example-pool&quot;</span>
+<a id="__codelineno-8-12" name="__codelineno-8-12" href="#__codelineno-8-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+</code></pre></div>
+<p>A persistent storage MGTs can be created with the following DW directive:</p>
+<div class="highlight"><pre><span></span><code><a id="__codelineno-9-1" name="__codelineno-9-1" href="#__codelineno-9-1"></a><span class="c1">#DW create_persistent name=mgt-pool-member-1 capacity=1GiB type=lustre profile=mgt-pool-member</span>
+</code></pre></div>
+<p>Multiple persistent instances with different names can be created using the <code>mgt-pool-member</code> profile to add more than one MGT to the pool.</p>
+<p>To create a Lustre file system that uses one of the MGTs from the pool, an <code>NnfStorageProfile</code> should be created that uses the special notation <code>pool:[pool-name]</code> in the <code>externalMgs</code> field.</p>
+<div class="highlight"><pre><span></span><code><a id="__codelineno-10-1" name="__codelineno-10-1" href="#__codelineno-10-1"></a><span class="nt">apiVersion</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf.cray.hpe.com/v1alpha1</span>
+<a id="__codelineno-10-2" name="__codelineno-10-2" href="#__codelineno-10-2"></a><span class="nt">kind</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">NnfStorageProfile</span>
+<a id="__codelineno-10-3" name="__codelineno-10-3" href="#__codelineno-10-3"></a><span class="nt">metadata</span><span class="p">:</span>
+<a id="__codelineno-10-4" name="__codelineno-10-4" href="#__codelineno-10-4"></a><span class="w">  </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">mgt-pool-consumer</span>
+<a id="__codelineno-10-5" name="__codelineno-10-5" href="#__codelineno-10-5"></a><span class="w">  </span><span class="nt">namespace</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nnf-system</span>
+<a id="__codelineno-10-6" name="__codelineno-10-6" href="#__codelineno-10-6"></a><span class="nt">data</span><span class="p">:</span>
+<a id="__codelineno-10-7" name="__codelineno-10-7" href="#__codelineno-10-7"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+<a id="__codelineno-10-8" name="__codelineno-10-8" href="#__codelineno-10-8"></a><span class="w">  </span><span class="nt">lustreStorage</span><span class="p">:</span>
+<a id="__codelineno-10-9" name="__codelineno-10-9" href="#__codelineno-10-9"></a><span class="w">    </span><span class="nt">externalMgs</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;pool:example-pool&quot;</span>
+<a id="__codelineno-10-10" name="__codelineno-10-10" href="#__codelineno-10-10"></a><span class="w">    </span><span class="nt">combinedMgtMdt</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">false</span>
+<a id="__codelineno-10-11" name="__codelineno-10-11" href="#__codelineno-10-11"></a><span class="w">    </span><span class="nt">standaloneMgtPoolName</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;&quot;</span>
+<a id="__codelineno-10-12" name="__codelineno-10-12" href="#__codelineno-10-12"></a><span class="p p-Indicator">[</span><span class="nv">...</span><span class="p p-Indicator">]</span>
+</code></pre></div>
 <p>The following provides an example DW directive that uses an MGT from the MGT pool:</p>
-<div class="highlight"><pre><span></span><code><a id="__codelineno-8-1" name="__codelineno-8-1" href="#__codelineno-8-1"></a><span class="c1">#DW jobdw name=example-lustre capacity=100GiB type=lustre profile=mgt-pool-consumer</span>
+<div class="highlight"><pre><span></span><code><a id="__codelineno-11-1" name="__codelineno-11-1" href="#__codelineno-11-1"></a><span class="c1">#DW jobdw name=example-lustre capacity=100GiB type=lustre profile=mgt-pool-consumer</span>
 </code></pre></div>
 <p>MGT pools are named, so there can be separate pools with collections of different MGTs in them. A storage profile targeting each pool would be needed.</p>
 
diff --git a/dev/guides/firmware-upgrade/readme/index.html b/dev/guides/firmware-upgrade/readme/index.html
index 7f30da2..24e48b3 100644
--- a/dev/guides/firmware-upgrade/readme/index.html
+++ b/dev/guides/firmware-upgrade/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/global-lustre/readme/index.html b/dev/guides/global-lustre/readme/index.html
index 6db8052..4218297 100644
--- a/dev/guides/global-lustre/readme/index.html
+++ b/dev/guides/global-lustre/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/ha-cluster/notes/index.html b/dev/guides/ha-cluster/notes/index.html
index 429ff50..3882b11 100644
--- a/dev/guides/ha-cluster/notes/index.html
+++ b/dev/guides/ha-cluster/notes/index.html
@@ -14,7 +14,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/ha-cluster/readme/index.html b/dev/guides/ha-cluster/readme/index.html
index 5c995f2..d6e1753 100644
--- a/dev/guides/ha-cluster/readme/index.html
+++ b/dev/guides/ha-cluster/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/index.html b/dev/guides/index.html
index 97404dc..dbcd116 100644
--- a/dev/guides/index.html
+++ b/dev/guides/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/initial-setup/readme/index.html b/dev/guides/initial-setup/readme/index.html
index ec4b37e..342e545 100644
--- a/dev/guides/initial-setup/readme/index.html
+++ b/dev/guides/initial-setup/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/node-management/drain/index.html b/dev/guides/node-management/drain/index.html
index 5d143bc..927814a 100644
--- a/dev/guides/node-management/drain/index.html
+++ b/dev/guides/node-management/drain/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/node-management/nvme-namespaces/index.html b/dev/guides/node-management/nvme-namespaces/index.html
index d96851d..0eef607 100644
--- a/dev/guides/node-management/nvme-namespaces/index.html
+++ b/dev/guides/node-management/nvme-namespaces/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/rbac-for-users/readme/index.html b/dev/guides/rbac-for-users/readme/index.html
index 97dbc60..536a07d 100644
--- a/dev/guides/rbac-for-users/readme/index.html
+++ b/dev/guides/rbac-for-users/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/storage-profiles/readme/index.html b/dev/guides/storage-profiles/readme/index.html
index b95dc1d..82fac85 100644
--- a/dev/guides/storage-profiles/readme/index.html
+++ b/dev/guides/storage-profiles/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/user-containers/readme/index.html b/dev/guides/user-containers/readme/index.html
index 5c901aa..7c6d9c8 100644
--- a/dev/guides/user-containers/readme/index.html
+++ b/dev/guides/user-containers/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/guides/user-interactions/readme/index.html b/dev/guides/user-interactions/readme/index.html
index cfd032f..9c839fd 100644
--- a/dev/guides/user-interactions/readme/index.html
+++ b/dev/guides/user-interactions/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/index.html b/dev/index.html
index 5656516..5b04811 100644
--- a/dev/index.html
+++ b/dev/index.html
@@ -16,7 +16,7 @@
       
       
       <link rel="icon" href="img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/repo-guides/readme/index.html b/dev/repo-guides/readme/index.html
index 335832b..755024e 100644
--- a/dev/repo-guides/readme/index.html
+++ b/dev/repo-guides/readme/index.html
@@ -14,7 +14,7 @@
       
       
       <link rel="icon" href="../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/repo-guides/release-nnf-sw/readme/index.html b/dev/repo-guides/release-nnf-sw/readme/index.html
index 9c8e379..b43cbc7 100644
--- a/dev/repo-guides/release-nnf-sw/readme/index.html
+++ b/dev/repo-guides/release-nnf-sw/readme/index.html
@@ -16,7 +16,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/rfcs/0001/readme/index.html b/dev/rfcs/0001/readme/index.html
index 781034d..35f68d8 100644
--- a/dev/rfcs/0001/readme/index.html
+++ b/dev/rfcs/0001/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/rfcs/0002/readme/index.html b/dev/rfcs/0002/readme/index.html
index 59b4a29..ea5318b 100644
--- a/dev/rfcs/0002/readme/index.html
+++ b/dev/rfcs/0002/readme/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../../../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/rfcs/index.html b/dev/rfcs/index.html
index 6a161c5..40a2ce6 100644
--- a/dev/rfcs/index.html
+++ b/dev/rfcs/index.html
@@ -18,7 +18,7 @@
       
       
       <link rel="icon" href="../img/logo.png">
-      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.30">
+      <meta name="generator" content="mkdocs-1.6.0, mkdocs-material-9.5.31">
     
     
       
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index 0252856..9f87ac3 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-,:!=\\[\\]()\"/]+|(?!\\b)(?=[A-Z][a-z])|\\.(?!\\d)|&[lg]t;","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Near Node Flash","text":"<p>Near Node Flash, also known as Rabbit, provides a disaggregated chassis-local storage solution which utilizes SR-IOV over a PCIe Gen 4.0 switching fabric to provide a set of compute blades with NVMe storage. It also provides a dedicated storage processor to offload tasks such as storage preparation and data movement from the compute nodes.</p> <p>Here you will find NNF User Guides, Examples, and Request For Comment (RFC) documents.</p>"},{"location":"guides/","title":"User Guides","text":""},{"location":"guides/#setup","title":"Setup","text":"<ul> <li>Initial Setup</li> <li>Compute Daemons</li> <li>Firmware Upgrade</li> <li>High Availability Cluster</li> <li>RBAC for Users</li> </ul>"},{"location":"guides/#provisioning","title":"Provisioning","text":"<ul> <li>Storage Profiles</li> <li>Data Movement Configuration</li> <li>Copy Offload API</li> <li>Lustre External MGT</li> <li>Global Lustre</li> <li>Directive Breakdown</li> <li>User Interactions</li> </ul>"},{"location":"guides/#nnf-user-containers","title":"NNF User Containers","text":"<ul> <li>User Containers</li> </ul>"},{"location":"guides/#node-management","title":"Node Management","text":"<ul> <li>Disable or Drain a Node</li> <li>Debugging NVMe Namespaces</li> </ul>"},{"location":"guides/compute-daemons/readme/","title":"Compute Daemons","text":"<p>Rabbit software requires two daemons be installed and run on each compute node. Each daemon shares similar build, package, and installation processes described below.</p> <ul> <li>The Client Mount daemon, <code>clientmount</code>, provides the support for mounting Rabbit hosted file systems on compute nodes.</li> <li>The Data Movement daemon, <code>nnf-dm</code>, supports creating, monitoring, and managing data movement (copy-offload) operations</li> </ul>"},{"location":"guides/compute-daemons/readme/#building-from-source","title":"Building from source","text":"<p>Each daemon can be built in their respective repositories using the <code>build-daemon</code> make target. Go version &gt;= 1.19 must be installed to perform a local build.</p>"},{"location":"guides/compute-daemons/readme/#rpm-package","title":"RPM Package","text":"<p>Each daemon is packaged as part of the build process in GitHub. Source and Binary RPMs are available.</p>"},{"location":"guides/compute-daemons/readme/#installation","title":"Installation","text":"<p>For manual install, place the binary in the <code>/usr/bin/</code> directory.</p> <p>To install the application as a daemon service, run <code>/usr/bin/[BINARY-NAME] install</code></p>"},{"location":"guides/compute-daemons/readme/#authentication","title":"Authentication","text":"<p>NNF software defines a Kubernetes Service Account for granting communication privileges between the daemon and the kubeapi server. The token file and certificate file can be obtained by providing the necessary Service Account and Namespace to the below shell script.</p> Compute Daemon Service Account Namespace Client Mount nnf-clientmount nnf-system Data Movement nnf-dm-daemon nnf-dm-system <pre><code>#!/bin/bash\n\nSERVICE_ACCOUNT=$1\nNAMESPACE=$2\n\nkubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data.token' | base64 --decode &gt; ./service.token\nkubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data[\"ca.crt\"]' | base64 --decode &gt; ./service.cert\n</code></pre> <p>The <code>service.token</code> and <code>service.cert</code> files must be copied to each compute node, typically in the <code>/etc/[BINARY-NAME]/</code> directory</p>"},{"location":"guides/compute-daemons/readme/#configuration","title":"Configuration","text":"<p>Installing the daemon will create a default configuration located at <code>/etc/systemd/system/[BINARY-NAME].service</code></p> <p>The command line arguments can be provided to the service definition or as an override file.</p> Argument Definition <code>--kubernetes-service-host=[ADDRESS]</code> The IP address or DNS entry of the kubeapi server <code>--kubernetes-service-port=[PORT]</code> The listening port of the kubeapi server <code>--service-token-file=[PATH]</code> Location of the service token file <code>--service-cert-file=[PATH]</code> Location of the service certificate file <code>--node-name=[COMPUTE-NODE-NAME]</code> Name of this compute node as described in the System Configuration. Defaults to the host name reported by the OS. <code>--nnf-node-name=[RABBIT-NODE-NAME]</code> <code>nnf-dm</code> daemon only. Name of the rabbit node connected to this compute node as described in the System Configuration. If not provided, the <code>--node-name</code> value is used to find the associated Rabbit node in the System Configuration. <code>--sys-config=[NAME]</code> <code>nnf-dm</code> daemon only. The System Configuration resource's name. Defaults to <code>default</code> <p>An example unit file for nnf-dm:</p> cat /etc/systemd/system/nnf-dm.service<pre><code>[Unit]\nDescription=Near-Node Flash (NNF) Data Movement Service\n\n[Service]\nPIDFile=/var/run/nnf-dm.pid\nExecStartPre=/bin/rm -f /var/run/nnf-dm.pid\nExecStart=/usr/bin/nnf-dm \\\n   --kubernetes-service-host=127.0.0.1 \\\n   --kubernetes-service-port=7777 \\\n   --service-token-file=/path/to/service.token \\\n   --service-cert-file=/path/to/service.cert \\\n   --kubernetes-qps=50 \\\n   --kubernetes-burst=100\nRestart=on-failure\n\n[Install]\nWantedBy=multi-user.target\n</code></pre> <p>An example unit file is for clientmountd:</p> cat /etc/systemd/system/clientmountd.service<pre><code>[Unit]\nDescription=Near-Node Flash (NNF) Clientmountd Service\n\n[Service]\nPIDFile=/var/run/clientmountd.pid\nExecStartPre=/bin/rm -f /var/run/clientmountd.pid\nExecStart=/usr/bin/clientmountd \\\n   --kubernetes-service-host=127.0.0.1 \\\n   --kubernetes-service-port=7777 \\\n   --service-token-file=/path/to/service.token \\\n   --service-cert-file=/path/to/service.cert\nRestart=on-failure\nEnvironment=GOGC=off\nEnvironment=GOMEMLIMIT=20MiB\nEnvironment=GOMAXPROCS=5\nEnvironment=HTTP2_PING_TIMEOUT_SECONDS=60\n\n[Install]\nWantedBy=multi-user.target\n</code></pre>"},{"location":"guides/compute-daemons/readme/#nnf-dm-specific-configuration","title":"nnf-dm Specific Configuration","text":"<p>nnf-dm has some additional configuration options that can be used to tweak the kubernetes client:</p> Argument Definition <code>--kubernetes-qps=[QPS]</code> The number of Queries Per Second (QPS) before client-side rate-limiting starts. Defaults to 50. <code>--kubernetes-burst=[QPS]</code> Once QPS is hit, allow this many concurrent calls. Defaults to 100."},{"location":"guides/compute-daemons/readme/#easy-deployment","title":"Easy Deployment","text":"<p>The nnf-deploy tool's <code>install</code> command can be used to run the daemons on a system's set of compute nodes. This option will compile the latest daemon binaries, retrieve the service token and certificates, and will copy and install the daemons on each of the compute nodes. Refer to the nnf-deploy repository and run <code>nnf-deploy install --help</code> for details.</p>"},{"location":"guides/data-movement/readme/","title":"Data Movement Configuration","text":"<p>Data Movement can be configured in multiple ways:</p> <ol> <li>Server side</li> <li>Per Copy Offload API Request arguments</li> </ol> <p>The first method is a \"global\" configuration - it affects all data movement operations. The second is done per the Copy Offload API, which allows for some configuration on a per-case basis, but is limited in scope. Both methods are meant to work in tandem.</p>"},{"location":"guides/data-movement/readme/#server-side-configmap","title":"Server Side ConfigMap","text":"<p>The server side configuration is done via the <code>nnf-dm-config</code> config map:</p> <pre><code>kubectl -n nnf-dm-system get configmap nnf-dm-config\n</code></pre> <p>The config map allows you to configure the following:</p> Setting Description slots The number of slots specified in the MPI hostfile. A value less than 1 disables the use of slots in the hostfile. maxSlots The number of max_slots specified in the MPI hostfile. A value less than 1 disables the use of max_slots in the hostfile. command The full command to execute data movement. More detail in the following section. progressIntervalSeconds interval to collect the progress data from the <code>dcp</code> command."},{"location":"guides/data-movement/readme/#command","title":"<code>command</code>","text":"<p>The full data movement <code>command</code> can be set here. By default, Data Movement uses <code>mpirun</code> to run <code>dcp</code> to perform the data movement. Changing the <code>command</code> is useful for tweaking <code>mpirun</code> or <code>dcp</code> options or to replace the command with something that can aid in debugging (e.g. <code>hostname</code>).</p> <p><code>mpirun</code> uses hostfiles to list the hosts to launch <code>dcp</code> on. This hostfile is created for each Data Movement operation, and it uses the config map to set the <code>slots</code> and <code>maxSlots</code> for each host (i.e. NNF node) in the hostfile. The number of <code>slots</code>/<code>maxSlots</code> is the same for every host in the hostfile.</p> <p>Additionally, Data Movement uses substitution to fill in dynamic information for each Data Movement operation. Each of these must be present in the command for Data Movement to work properly when using <code>mpirun</code> and <code>dcp</code>:</p> VAR Description <code>$HOSTFILE</code> hostfile that is created and used for mpirun. <code>$UID</code> User ID that is inherited from the Workflow. <code>$GID</code> Group ID that is inherited from the Workflow. <code>$SRC</code> source for the data movement. <code>$DEST</code> destination for the data movement. <p>By default, the command will look something like the following. Please see the config map itself for the most up to date default command:</p> <pre><code>mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress 1 --uid $UID --gid $GID $SRC $DEST\n</code></pre>"},{"location":"guides/data-movement/readme/#profiles","title":"Profiles","text":"<p>Profiles can be specified in the in the <code>nnf-dm-config</code> config map. Users are able to select a profile using #DW directives (e.g .<code>copy_in profile=my-dm-profile</code>) and the Copy Offload API. If no profile is specified, the <code>default</code> profile is used. This default profile must exist in the config map.</p> <p><code>slots</code>, <code>maxSlots</code>, and <code>command</code> can be stored in Data Movement profiles. These profiles are available to quickly switch between different settings for a particular workflow.</p> <p>Example profiles:</p> <pre><code>profiles:\n  default:\n      slots: 8\n      maxSlots: 0\n      command: mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress 1 --uid $UID --gid $GID $SRC $DEST\n  no-xattrs:\n      slots: 8\n      maxSlots: 0\n      command: mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress 1 --xattrs none --uid $UID --gid $GID $SRC $DEST\n</code></pre>"},{"location":"guides/data-movement/readme/#copy-offload-api-daemon","title":"Copy Offload API Daemon","text":"<p>The <code>CreateRequest</code> API call that is used to create Data Movement with the Copy Offload API has some options to allow a user to specify some options for that particular Data Movement. These settings are on a per-request basis.</p> <p>The Copy Offload API requires the <code>nnf-dm</code> daemon to be running on the compute node. This daemon may be configured to run full-time, or it may be left in a disabled state if the WLM is expected to run it only when a user requests it. See Compute Daemons for the systemd service configuration of the daemon. See <code>RequiredDaemons</code> in Directive Breakdown for a description of how the user may request the daemon, in the case where the WLM will run it only on demand.</p> <p>If the WLM is running the <code>nnf-dm</code> daemon only on demand, then the user can request that the daemon be running for their job by specifying <code>requires=copy-offload</code> in their <code>DW</code> directive. The following is an example:</p> <pre><code>#DW jobdw type=xfs capacity=1GB name=stg1 requires=copy-offload\n</code></pre> <p>See the DataMovementCreateRequest API definition for what can be configured.</p>"},{"location":"guides/data-movement/readme/#selinux-and-data-movement","title":"SELinux and Data Movement","text":"<p>Careful consideration must be taken when enabling SELinux on compute nodes. Doing so will result in SELinux Extended File Attributes (xattrs) being placed on files created by applications running on the compute node, which may not be supported by the destination file system (e.g. Lustre).</p> <p>Depending on the configuration of <code>dcp</code>, there may be an attempt to copy these xattrs. You may need to disable this by using <code>dcp --xattrs none</code> to avoid errors. For example, the <code>command</code> in the <code>nnf-dm-config</code> config map or <code>dcpOptions</code> in the DataMovementCreateRequest API could be used to set this option.</p> <p>See the <code>dcp</code> documentation for more information.</p>"},{"location":"guides/directive-breakdown/readme/","title":"Directive Breakdown","text":""},{"location":"guides/directive-breakdown/readme/#background","title":"Background","text":"<p>The <code>#DW</code> directives in a job script are not intended to be interpreted by the workload manager. The workload manager passes the <code>#DW</code> directives to the NNF software through the DWS <code>workflow</code> resource, and the NNF software determines what resources are needed to satisfy the directives. The NNF software communicates this information back to the workload manager through the DWS <code>DirectiveBreakdown</code> resource. This document describes how the WLM should interpret the information in the <code>DirectiveBreakdown</code>.</p>"},{"location":"guides/directive-breakdown/readme/#directivebreakdown-overview","title":"DirectiveBreakdown Overview","text":"<p>The DWS <code>DirectiveBreakdown</code> contains all the information necessary to inform the WLM how to pick storage and compute nodes for a job. The <code>DirectiveBreakdown</code> resource is created by the NNF software during the <code>Proposal</code> phase of the DWS workflow. The <code>spec</code> section of the <code>DirectiveBreakdown</code> is filled in with the <code>#DW</code> directive by the NNF software, and the <code>status</code> section contains the information for the WLM. The WLM should wait until the <code>status.ready</code> field is true before interpreting the rest of the <code>status</code> fields.</p> <p>The contents of the <code>DirectiveBreakdown</code> will look different depending on the file system type and options specified by the user. The <code>status</code> section contains enough information that the WLM may be able to figure out the underlying file system type requested by the user, but the WLM should not make any decisions based on the file system type. Instead, the WLM should make storage and compute allocation decisions based on the generic information provided in the <code>DirectiveBreakdown</code> since the storage and compute allocations needed to satisfy a <code>#DW</code> directive may differ based on options other than the file system type.</p>"},{"location":"guides/directive-breakdown/readme/#storage-nodes","title":"Storage Nodes","text":"<p>The <code>status.storage</code> section of the <code>DirectiveBreakdown</code> describes how the storage allocations should be made and any constraints on the NNF nodes that can be picked. The <code>status.storage</code> section will exist only for <code>jobdw</code> and <code>create_persistent</code> directives. An example of the <code>status.storage</code> section is included below.</p> <pre><code>...\nspec:\n  directive: '#DW jobdw capacity=1GiB type=xfs name=example'\n    userID: 7900\nstatus:\n...\n  ready: true\n  storage:\n    allocationSets:\n    - allocationStrategy: AllocatePerCompute\n      constraints:\n        labels:\n        - dataworkflowservices.github.io/storage=Rabbit\n      label: xfs\n      minimumCapacity: 1073741824\n    lifetime: job\n    reference:\n      kind: Servers\n      name: example-0\n      namespace: default\n...\n</code></pre> <ul> <li> <p><code>status.storage.allocationSets</code> is a list of storage allocation sets that are needed for the job. An allocation set is a group of individual storage allocations that all have the same parameters and requirements. Depending on the storage type specified by the user, there may be more than one allocation set. Allocation sets should be handled independently.</p> </li> <li> <p><code>status.storage.allocationSets.allocationStrategy</code> specifies how the allocations should be made.</p> <ul> <li><code>AllocatePerCompute</code> - One allocation is needed per compute node in the job. The size of an individual allocation is specified in <code>status.storage.allocationSets.minimumCapacity</code></li> <li><code>AllocateAcrossServers</code> - One or more allocations are needed with an aggregate capacity of <code>status.storage.allocationSets.minimumCapacity</code>. This allocation strategy does not imply anything about how many allocations to make per NNF node or how many NNF nodes to use. The allocations on each NNF node should be the same size.</li> <li><code>AllocateSingleServer</code> - One allocation is needed with a capacity of <code>status.storage.allocationSets.minimumCapacity</code></li> </ul> </li> <li> <p><code>status.storage.allocationSets.constraints</code> is a set of requirements for which NNF nodes can be picked. More information about the different constraint types is provided in the Storage Constraints section below.</p> </li> <li> <p><code>status.storage.allocationSets.label</code> is an opaque string that the WLM uses when creating the spec.allocationSets entry in the DWS <code>Servers</code> resource.</p> </li> <li> <p><code>status.storage.allocationSets.minimumCapacity</code> is the allocation capacity in bytes. The interpretation of this field depends on the value of <code>status.storage.allocationSets.allocationStrategy</code></p> </li> <li> <p><code>status.storage.lifetime</code> is used to specify how long the storage allocations will last.</p> <ul> <li><code>job</code> - The allocation will last for the lifetime of the job</li> <li><code>persistent</code> - The allocation will last for longer than the lifetime of the job</li> </ul> </li> <li> <p><code>status.storage.reference</code> is an object reference to a DWS <code>Servers</code> resource where the WLM can specify allocations</p> </li> </ul>"},{"location":"guides/directive-breakdown/readme/#storage-constraints","title":"Storage Constraints","text":"<p>Constraints on an allocation set provide additional requirements for how the storage allocations should be made on NNF nodes.</p> <ul> <li> <p><code>labels</code> specifies a list of labels that must all be on a DWS <code>Storage</code> resource in order for an allocation to exist on that <code>Storage</code>. <pre><code>constraints:\n  labels:\n  - dataworkflowservices.github.io/storage=Rabbit\n  - mysite.org/pool=firmware_test\n</code></pre> <pre><code>apiVersion: dataworkflowservices.github.io/v1alpha2\nkind: Storage\nmetadata:\n  labels:\n    dataworkflowservices.github.io/storage: Rabbit\n    mysite.org/pool: firmware_test\n    mysite.org/drive-speed: fast\n  name: rabbit-node-1\n  namespace: default\n  ...\n</code></pre></p> </li> <li> <p><code>colocation</code> specifies how two or more allocations influence the location of each other. The colocation constraint has two fields, <code>type</code> and <code>key</code>. Currently, the only value for <code>type</code> is <code>exclusive</code>. <code>key</code> can be any value. This constraint means that the allocations from an allocation set with the colocation constraint can't be placed on an NNF node with another allocation whose allocation set has a colocation constraint with the same key. Allocations from allocation sets with colocation constraints with different keys or allocation sets without the colocation constraint are okay to put on the same NNF node. <pre><code>constraints:\n  colocation:\n    type: exclusive\n    key: lustre-mgt\n</code></pre></p> </li> <li> <p><code>count</code> this field specifies the number of allocations to make when <code>status.storage.allocationSets.allocationStrategy</code> is <code>AllocateAcrossServers</code> <pre><code>constraints:\n  count: 5\n</code></pre></p> </li> <li> <p><code>scale</code> is a unitless value from 1-10 that is meant to guide the WLM on how many allocations to make when <code>status.storage.allocationSets.allocationStrategy</code> is <code>AllocateAcrossServers</code>. The actual number of allocations is not meant to correspond to the value of scale. Rather, 1 would indicate the minimum number of allocations to reach <code>status.storage.allocationSets.minimumCapacity</code>, and 10 would be the maximum number of allocations that make sense given the <code>status.storage.allocationSets.minimumCapacity</code> and the compute node count. The NNF software does not interpret this value, and it is up to the WLM to define its meaning. <pre><code>constraints:\n  scale: 8\n</code></pre></p> </li> </ul>"},{"location":"guides/directive-breakdown/readme/#compute-nodes","title":"Compute Nodes","text":"<p>The <code>status.compute</code> section of the <code>DirectiveBreakdown</code> describes how the WLM should pick compute nodes for a job. The <code>status.compute</code> section will exist only for <code>jobdw</code> and <code>persistentdw</code> directives. An example of the <code>status.compute</code> section is included below.</p> <pre><code>...\nspec:\n  directive: '#DW jobdw capacity=1TiB type=lustre name=example'\n    userID: 3450\nstatus:\n...\n  compute:\n    constraints:\n      location:\n      - access:\n        - priority: mandatory\n          type: network\n        - priority: bestEffort\n          type: physical\n        reference:\n          fieldPath: servers.spec.allocationSets[0]\n          kind: Servers\n          name: example-0\n          namespace: default\n      - access:\n        - priority: mandatory\n          type: network\n        reference:\n          fieldPath: servers.spec.allocationSets[1]\n          kind: Servers\n          name: example-0\n          namespace: default\n...\n</code></pre> <p>The <code>status.compute.constraints</code> section lists any constraints on which compute nodes can be used. Currently the only constraint type is the <code>location</code> constraint. <code>status.compute.constraints.location</code> is a list of location constraints that all must be satisfied.</p> <p>A location constraint consists of an <code>access</code> list and a <code>reference</code>.</p> <ul> <li><code>status.compute.constraints.location.reference</code> is an object reference with a <code>fieldPath</code> that points to an allocation set in the <code>Servers</code> resource. If this is from a <code>#DW jobdw</code> directive, the <code>Servers</code> resource won't be filled in until the WLM picks storage nodes for the allocations.</li> <li><code>status.compute.constraints.location.access</code> is a list that specifies what type of access the compute nodes need to have to the storage allocations in the allocation set. An allocation set may have multiple access types that are required<ul> <li><code>status.compute.constraints.location.access.type</code> specifies the connection type for the storage. This can be <code>network</code> or <code>physical</code></li> <li><code>status.compute.constraints.location.access.priority</code> specifies how necessary the connection type is. This can be <code>mandatory</code> or <code>bestEffort</code></li> </ul> </li> </ul>"},{"location":"guides/directive-breakdown/readme/#requireddaemons","title":"RequiredDaemons","text":"<p>The <code>status.requiredDaemons</code> section of the <code>DirectiveBreakdown</code> tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The <code>status.requiredDaemons</code> section will exist only for <code>jobdw</code> and <code>persistentdw</code> directives. An example of the <code>status.requiredDaemons</code> section is included below.</p> <pre><code>status:\n...\n  requiredDaemons:\n  - copy-offload\n...\n</code></pre> <p>The allowed list of required daemons that may be specified is defined in the nnf-ruleset.yaml for DWS, found in the <code>nnf-sos</code> repository. The <code>ruleDefs.key[requires]</code> statement is specified in two places in the ruleset, one for <code>jobdw</code> and the second for <code>persistentdw</code>. The ruleset allows a list of patterns to be specified, allowing one for each of the allowed daemons.</p> <p>The <code>DW</code> directive will include a comma-separated list of daemons after the <code>requires</code> keyword. The following is an example:</p> <pre><code>#DW jobdw type=xfs capacity=1GB name=stg1 requires=copy-offload\n</code></pre> <p>The <code>DWDirectiveRule</code> resource currently active on the system can be viewed with:</p> <pre><code>kubectl get -n dws-system dwdirectiverule nnf -o yaml\n</code></pre>"},{"location":"guides/directive-breakdown/readme/#valid-daemons","title":"Valid Daemons","text":"<p>Each site should define the list of daemons that are valid for that site and recognized by that site's WLM. The initial <code>nnf-ruleset.yaml</code> defines only one, called <code>copy-offload</code>. When a user specifies <code>copy-offload</code> in their <code>DW</code> directive, they are stating that their compute-node application will use the Copy Offload API Daemon described in the Data Movement Configuration.</p>"},{"location":"guides/external-mgs/readme/","title":"Lustre External MGT","text":""},{"location":"guides/external-mgs/readme/#background","title":"Background","text":"<p>Lustre has a limitation where only a single MGT can be mounted on a node at a time. In some situations it may be desirable to share an MGT between multiple Lustre file systems to increase the number of Lustre file systems that can be created and to decrease scheduling complexity. This guide provides instructions on how to configure NNF to share MGTs. There are three methods that can be used:</p> <ol> <li>Use a Lustre MGT from outside the NNF cluster</li> <li>Create a persistent Lustre file system through DWS and use the MGT it provides</li> <li>Create a pool of standalone persistent Lustre MGTs, and have the NNF software select one of them</li> </ol> <p>These three methods are not mutually exclusive on the system as a whole. Individual file systems can use any of options 1-3 or create their own MGT.</p>"},{"location":"guides/external-mgs/readme/#configuration-with-an-external-mgt","title":"Configuration with an External MGT","text":"<p>An existing MGT external to the NNF cluster can be used to manage the Lustre file systems on the NNF nodes. An advantage to this configuration is that the MGT can be highly available through multiple MGSs. A disadvantage is that there is only a single MGT. An MGT shared between more than a handful of Lustre file systems is not a common use case, so the Lustre code may prove less stable.</p> <p>The following yaml provides an example of what the <code>NnfStorageProfile</code> should contain to use an MGT on an external server.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: external-mgt\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: 1.2.3.4@eth0\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre>"},{"location":"guides/external-mgs/readme/#configuration-with-persistent-lustre","title":"Configuration with Persistent Lustre","text":"<p>The MGT from a persistent Lustre file system hosted on the NNF nodes can also be used as the MGT for other NNF Lustre file systems. This configuration has the advantage of not relying on any hardware outside of the cluster. However, there is no high availability, and a single MGT is still shared between all Lustre file systems created on the cluster.</p> <p>To configure a persistent Lustre file system that can share its MGT, a <code>NnfStorageProfile</code> should be used that does not specify <code>externalMgs</code>. The MGT can either share a volume with the MDT or not (<code>combinedMgtMdt</code>).</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: persistent-lustre-shared-mgt\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: \"\"\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre> <p>The persistent storage is created with the following DW directive:</p> <pre><code>#DW create_persistent name=shared-lustre capacity=100GiB type=lustre profile=persistent-lustre-shared-mgt\n</code></pre> <p>After the persistent Lustre file system is created, an admin can discover the MGS address by looking at the <code>NnfStorage</code> resource with the same name as the persistent storage that was created (<code>shared-lustre</code> in the above example).</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorage\nmetadata:\n  name: shared-lustre\n  namespace: default\n[...]\nstatus:\n  mgsNode: 5.6.7.8@eth1\n[...]\n</code></pre> <p>A separate <code>NnfStorageProfile</code> can be created that specifies the MGS address.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: internal-mgt\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: 5.6.7.8@eth1\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre> <p>With this configuration, an admin must determine that no file systems are using the shared MGT before destroying the persistent Lustre instance.</p>"},{"location":"guides/external-mgs/readme/#configuration-with-an-internal-mgt-pool","title":"Configuration with an Internal MGT Pool","text":"<p>Another method NNF supports is to create a number of persistent Lustre MGTs on NNF nodes. These MGTs are not part of a full file system, but are instead added to a pool of MGTs available for other Lustre file systems to use. Lustre file systems that are created will choose one of the MGTs at random to use and add a reference to make sure it isn't destroyed. This configuration has the advantage of spreading the Lustre management load across multiple servers. The disadvantage of this configuration is that it does not provide high availability.</p> <p>To configure the system this way, the first step is to make a pool of Lustre MGTs. This is done by creating a persistent instance from a storage profile that specifies the <code>standaloneMgtPoolName</code> option. This option tells NNF software to only create an MGT, and to add it to a named pool. The following <code>NnfStorageProfile</code> provides an example where the MGT is added to the <code>example-pool</code> pool:</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: mgt-pool-member\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: \"\"\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"example-pool\"\n[...]\n</code></pre> <p>A persistent storage MGTs can be created with the following DW directive:</p> <pre><code>#DW create_persistent name=mgt-pool-member-1 capacity=1GiB type=lustre profile=mgt-pool-member\n</code></pre> <p>Multiple persistent instances with different names can be created using the <code>mgt-pool-member</code> profile to add more than one MGT to the pool.</p> <p>To create a Lustre file system that uses one of the MGTs from the pool, an <code>NnfStorageProfile</code> should be created that uses the special notation <code>pool:[pool-name]</code> in the <code>externalMgs</code> field.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: mgt-pool-consumer\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: \"pool:example-pool\"\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre> <p>The following provides an example DW directive that uses an MGT from the MGT pool:</p> <pre><code>#DW jobdw name=example-lustre capacity=100GiB type=lustre profile=mgt-pool-consumer\n</code></pre> <p>MGT pools are named, so there can be separate pools with collections of different MGTs in them. A storage profile targeting each pool would be needed.</p>"},{"location":"guides/firmware-upgrade/readme/","title":"Firmware Upgrade Procedures","text":"<p>This guide presents the firmware upgrade procedures to upgrade firmware from the Rabbit using tools present in the operating system.</p>"},{"location":"guides/firmware-upgrade/readme/#pcie-switch-firmware-upgrade","title":"PCIe Switch Firmware Upgrade","text":"<p>In order to upgrade the firmware on the PCIe switch, the <code>switchtec</code> kernel driver and utility of the same name must be installed. Rabbit hardware consists of two PCIe switches, which can be managed by devices typically located at <code>/dev/switchtec0</code> and <code>/dev/switchtec1</code>.</p> <p>Danger</p> <p>Upgrading the switch firmware will cause the switch to reset. Prototype Rabbit units not supporting hotplug should undergo a power-cycle to ensure switch initialization following firmware uprade. Similarily, compute nodes not supporting hotplug may lose connectivity after firmware upgrade and should also be power-cycled.</p> <pre><code>IMAGE=$1 # Provide the path to the firmware image file\nSWITCHES=(\"/dev/switchtec0\" \"/dev/switchtec1\")\nfor SWITCH in \"${SWITCHES[@]}\"; do switchtec fw-update \"$SWITCH\" \"$IMAGE\" --yes; done\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#nvme-drive-firmware-upgrade","title":"NVMe Drive Firmware Upgrade","text":"<p>In order to upgrade the firmware on NVMe drives attached to Rabbit, the <code>switchtec</code> and <code>switchtec-nvme</code> executables must be installed. All firmware downloads to drives are sent to the physical function of the drive which is accessible only using the <code>switchtec-nvme</code> executable.</p>"},{"location":"guides/firmware-upgrade/readme/#batch-method","title":"Batch Method","text":""},{"location":"guides/firmware-upgrade/readme/#download-and-commit-new-firmware","title":"Download and Commit New Firmware","text":"<p>The nvme.sh helper script applies the same command to each physical device fabric ID in the system. It provides a convenient way to upgrade the firmware on all drives in the system. Please see fw-download and fw-commit for details about the individual commands.</p> <pre><code># Download firmware to all drives\n./nvme.sh cmd fw-download --fw=&lt;/path/to/nvme.fw&gt;\n\n# Commit the new firmware\n# action=3: The image is requested to be activated immediately\n./nvme.sh cmd fw-commit --action=3\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#rebind-the-pcie-connections","title":"Rebind the PCIe Connections","text":"<p>In order to use the drives at this point, they must be unbound and bound to the PCIe fabric to reset device connections. The bind.sh helper script performs these two actions. Its use is illustrated below.</p> <pre><code># Unbind all drives from the Rabbit to disconnect the PCIe connection to the drives\n./bind.sh unbind\n\n# Bind all drives to the Rabbit to reconnect the PCIe bus\n./bind.sh bind\n\n# At this point, your drives should be running the new firmware.\n# Verify the firmware...\n./nvme.sh cmd id-ctrl | grep -E \"^fr \"\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#individual-drive-method","title":"Individual Drive Method","text":""},{"location":"guides/firmware-upgrade/readme/#determine-physical-device-fabric-id","title":"Determine Physical Device Fabric ID","text":"<p>The first step is to determine a drive's unique Physical Device Fabric Identifier (PDFID). The following code fragment demonstrates one way to list the physcial device fabric ids of all the NVMe drives in the system.</p> <pre><code>#!/bin/bash\n\nSWITCHES=(\"/dev/switchtec0\" \"/dev/switchtec1\")\nfor SWITCH in \"${SWITCHES[@]}\";\ndo\n    mapfile -t PDFIDS &lt; &lt;(sudo switchtec fabric gfms-dump \"${SWITCH}\" | grep \"Function 0 \" -A1 | grep PDFID | awk '{print $2}')\n    for INDEX in \"${!PDFIDS[@]}\";\n    do\n        echo \"${PDFIDS[$INDEX]}@$SWITCH\"\n    done\ndone\n</code></pre> <pre><code># Produces a list like this:\n0x1300@/dev/switchtec0\n0x1600@/dev/switchtec0\n0x1700@/dev/switchtec0\n0x1400@/dev/switchtec0\n0x1800@/dev/switchtec0\n0x1900@/dev/switchtec0\n0x1500@/dev/switchtec0\n0x1a00@/dev/switchtec0\n0x4100@/dev/switchtec1\n0x3c00@/dev/switchtec1\n0x4000@/dev/switchtec1\n0x3e00@/dev/switchtec1\n0x4200@/dev/switchtec1\n0x3b00@/dev/switchtec1\n0x3d00@/dev/switchtec1\n0x3f00@/dev/switchtec1\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#download-firmware","title":"Download Firmware","text":"<p>Using the physical device fabric identifier, the following commands update the firmware for specified drive.</p> <pre><code># Download firmware to the drive\nsudo switchtec-nvme fw-download &lt;PhysicalDeviceFabricID&gt; --fw=&lt;/path/to/nvme.fw&gt;\n\n# Activate the new firmware\n# action=3: The image is requested to be activated immediately without reset.\nsudo switchtec-nvme fw-commit --action=3\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#rebind-pcie-connection","title":"Rebind PCIe Connection","text":"<p>Once the firmware has been downloaded and committed, the PCIe connection from the Rabbit to the drive must be unbound and rebound. Please see bind.sh for details.</p>"},{"location":"guides/global-lustre/readme/","title":"Global Lustre","text":""},{"location":"guides/global-lustre/readme/#background","title":"Background","text":"<p>Adding global lustre to rabbit systems allows access to external file systems. This is primarily used for Data Movement, where a user can perform <code>copy_in</code> and <code>copy_out</code> directives with global lustre being the source and destination, respectively.</p> <p>Global lustre fileystems are represented by the <code>lustrefilesystems</code> resource in Kubernetes:</p> <pre><code>$ kubectl get lustrefilesystems -A\nNAMESPACE   NAME       FSNAME   MGSNIDS          AGE\ndefault     mylustre   mylustre 10.1.1.113@tcp   20d\n</code></pre> <p>An example resource is as follows:</p> <pre><code>apiVersion: lus.cray.hpe.com/v1beta1\nkind: LustreFileSystem\nmetadata:\n  name: mylustre\n  namespace: default\nspec:\n  mgsNids: 10.1.1.100@tcp\n  mountRoot: /p/mylustre\n  name: mylustre\n  namespaces:\n    default:\n      modes:\n        - ReadWriteMany\n</code></pre>"},{"location":"guides/global-lustre/readme/#namespaces","title":"Namespaces","text":"<p>Note the <code>spec.namespaces</code> field. For each namespace listed, the <code>lustre-fs-operator</code> creates a PV/PVC pair in that namespace. This allows pods in that namespace to access global lustre. The <code>default</code> namespace should appear in this list. This makes the <code>lustrefilesystem</code> resource available to the <code>default</code> namespace, which makes it available to containers (e.g.  container workflows) running in the <code>default</code> namespace.</p> <p>The <code>nnf-dm-system</code> namespace is added automatically - no need to specify that manually here. The NNF Data Movement Manager is responsible for ensuring that the <code>nnf-dm-system</code> is in <code>spec.namespaces</code>. This is to ensure that the NNF DM Worker pods have global lustre mounted as long as <code>nnf-dm</code> is deployed. To unmount global lustre from the NNF DM Worker pods, the <code>lustrefilesystem</code> resource must be deleted.</p> <p>The <code>lustrefilesystem</code> resource itself should be created in the <code>default</code> namespace (i.e. <code>metadata.namespace</code>).</p>"},{"location":"guides/global-lustre/readme/#nnf-data-movement-manager","title":"NNF Data Movement Manager","text":"<p>The NNF Data Movement Manager is responsible for monitoring <code>lustrefilesystem</code> resources to mount (or umount) the global lustre filesystem in each of the NNF DM Worker pods. These pods run on each of the NNF nodes. This means with each addition or removal of <code>lustrefilesystems</code> resources, the DM worker pods restart to adjust their mount points.</p> <p>The NNF Data Movement Manager also places a finalizer on the <code>lustrefilesystem</code> resource to indicate that the resource is in use by Data Movement. This is to prevent the PV/PVC being deleted while they are being used by pods.</p>"},{"location":"guides/global-lustre/readme/#adding-global-lustre","title":"Adding Global Lustre","text":"<p>As mentioned previously, the NNF Data Movement Manager monitors these resources and automatically adds the <code>nnf-dm-system</code> namespace to all <code>lustrefilesystem</code> resources. Once this happens, a PV/PVC is created for the <code>nnf-dm-system</code> namespace to access global lustre. The Manager updates the NNF DM Worker pods, which are then restarted to mount the global lustre file system.</p>"},{"location":"guides/global-lustre/readme/#removing-global-lustre","title":"Removing Global Lustre","text":"<p>When a <code>lustrefilesystem</code> is deleted, the NNF DM Manager takes notice and starts to unmount the file system from the DM Worker pods - causing another restart of the DM Worker pods. Once this is finished, the DM finalizer is removed from the <code>lustrefilesystem</code> resource to signal that it is no longer in use by Data Movement.</p> <p>If a <code>lustrefilesystem</code> does not delete, check the finalizers to see what might still be using it. It is possible to get into a situation where <code>nnf-dm</code> has been undeployed, so there is nothing to remove the DM finalizer from the <code>lustrefilesystem</code> resource. If that is the case, then manually remove the DM finalizer so the deletion of the <code>lustrefilesystem</code> resource can continue.</p>"},{"location":"guides/ha-cluster/notes/","title":"Notes","text":"<p>pcs stonith create stonith-rabbit-node-1 fence_nnf pcmk_host_list=rabbit-node-1 kubernetes-service-host=10.30.107.247 kubernetes-service-port=6443 service-token-file=/etc/nnf/service.token service-cert-file=/etc/nnf/service.cert nnf-node-name=rabbit-node-1 verbose=1</p> <p>pcs stonith create stonith-rabbit-compute-2 fence_redfish pcmk_host_list=\"rabbit-compute-2\" ip=10.30.105.237 port=80 systems-uri=/redfish/v1/Systems/1 username=root password=REDACTED ssl_insecure=true verbose=1</p> <p>pcs stonith create stonith-rabbit-compute-3 fence_redfish pcmk_host_list=\"rabbit-compute-3\" ip=10.30.105.253 port=80 systems-uri=/redfish/v1/Systems/1 username=root password=REDACTED ssl_insecure=true verbose=1</p>"},{"location":"guides/ha-cluster/readme/","title":"High Availability Cluster","text":"<p>NNF software supports provisioning of Red Hat GFS2 (Global File System 2) storage. Per RedHat:</p> <p>GFS2 allows multiple nodes to share storage at a block level as if the storage were connected locally to each cluster node. GFS2 cluster file system requires a cluster infrastructure.</p> <p>Therefore, in order to use GFS2, the NNF node and its associated compute nodes must form a high availability cluster.</p>"},{"location":"guides/ha-cluster/readme/#cluster-setup","title":"Cluster Setup","text":"<p>Red Hat provides instructions for creating a high availability cluster with Pacemaker, including instructions for installing cluster software and creating a high availability cluster. When following these instructions, each of the high availability clusters that are created should be named after the hostname of the NNF node. In the Red Hat examples the cluster name is <code>my_cluster</code>.</p>"},{"location":"guides/ha-cluster/readme/#fencing-agents","title":"Fencing Agents","text":"<p>Fencing is the process of restricting and releasing access to resources that a failed cluster node may have access to. Since a failed node may be unresponsive, an external device must exist that can restrict access to shared resources of that node, or to issue a hard reboot of the node. More information can be found form Red Hat: 1.2.1 Fencing.</p> <p>HPE hardware implements software known as the Hardware System Supervisor (HSS), which itself conforms to the SNIA Redfish/Swordfish standard. This provides the means to manage hardware outside the host OS.</p>"},{"location":"guides/ha-cluster/readme/#nnf-fencing","title":"NNF Fencing","text":""},{"location":"guides/ha-cluster/readme/#source","title":"Source","text":"<p>The NNF Fencing agent is available at https://github.com/NearNodeFlash/fence-agents under the <code>nnf</code> branch.</p> <pre><code>git clone https://github.com/NearNodeFlash/fence-agents --branch nnf\n</code></pre>"},{"location":"guides/ha-cluster/readme/#build","title":"Build","text":"<p>Refer to the <code>NNF.md file</code> at the root directory of the fence-agents repository. The fencing agents must be installed on every node in the cluster.</p>"},{"location":"guides/ha-cluster/readme/#setup","title":"Setup","text":"<p>Configure the NNF agent with the following parameters:</p> Argument Definition <code>kubernetes-service-host=[ADDRESS]</code> The IP address of the kubeapi server <code>kubernetes-service-port=[PORT]</code> The listening port of the kubeapi server <code>service-token-file=[PATH]</code> The location of the service token file. The file must be present on all nodes within the cluster <code>service-cert-file=[PATH]</code> The location of the service certificate file. The file must be present on all nodes within the cluster <code>nnf-node-name=[NNF-NODE-NAME]</code> Name of the NNF node as it is appears in the System Configuration <code>api-version=[VERSION]</code> The API Version of the NNF Node resource. Defaults to \"v1alpha1\" <p>The token and certificate can be found in the Kubernetes Secrets resource for the nnf-system/nnf-fencing-agent ServiceAccount. This provides RBAC rules to limit the fencing agent to only the Kubernetes resources it needs access to.</p> <p>For example, setting up the NNF fencing agent on <code>rabbit-node-1</code> with a kubernetes service API running at <code>192.168.0.1:6443</code> and the service token and certificate copied to <code>/etc/nnf/fence/</code>. This needs to be run on one node in the cluster.</p> <pre><code>pcs stonith create rabbit-node-1 fence_nnf pcmk_host_list=rabbit-node-1 kubernetes-service-host=192.168.0.1 kubernetes-service-port=6443 service-token-file=/etc/nnf/fence/service.token service-cert-file=/etc/nnf/fence/service.cert nnf-node-name=rabbit-node-1\n</code></pre>"},{"location":"guides/ha-cluster/readme/#recovery","title":"Recovery","text":"<p>Since the NNF node is connected to 16 compute blades, careful coordination around fencing of a NNF node is required to minimize the impact of the outage. When a Rabbit node is fenced, the corresponding DWS Storage resource (<code>storages.dws.cray.hpe.com</code>) status changes. The workload manager must observe this change and follow the procedure below to recover from the fencing status.</p> <ol> <li>Observed the <code>storage.Status</code> changed and that <code>storage.Status.RequiresReboot == True</code></li> <li>Set the <code>storage.Spec.State := Disabled</code></li> <li>Wait for a change to the Storage status <code>storage.Status.State == Disabled</code></li> <li>Reboot the NNF node</li> <li>Set the <code>storage.Spec.State := Enabled</code></li> <li>Wait for <code>storage.Status.State == Enabled</code></li> </ol>"},{"location":"guides/ha-cluster/readme/#compute-fencing","title":"Compute Fencing","text":"<p>The Redfish fencing agent from ClusterLabs should be used for Compute nodes in the cluster. It is also included at https://github.com/NearNodeFlash/fence-agents, and can be built at the same time as the NNF fencing agent. Configure the agent with the following parameters:</p> Argument Definition <code>ip=[ADDRESS]</code> The IP address or hostname of the HSS controller <code>port=80</code> The Port of the HSS controller. Must be <code>80</code> <code>systems-uri=/redfish/v1/Systems/1</code> The URI of the Systems object. Must be <code>/redfish/v1/Systems/1</code> <code>ssl-insecure=true</code> Instructs the use of an insecure SSL exchange. Must be <code>true</code> <code>username=[USER]</code> The user name for connecting to the HSS controller <code>password=[PASSWORD]</code> the password for connecting to the HSS controller <p>For example, setting up the Redfish fencing agent on <code>rabbit-compute-2</code> with the redfish service at <code>192.168.0.1</code>. This needs to be run on one node in the cluster.</p> <pre><code>pcs stonith create rabbit-compute-2 fence_redfish pcmk_host_list=rabbit-compute-2 ip=192.168.0.1 systems-uri=/redfish/v1/Systems/1 username=root password=password ssl_insecure=true\n</code></pre>"},{"location":"guides/ha-cluster/readme/#dummy-fencing","title":"Dummy Fencing","text":"<p>The dummy fencing agent from ClusterLabs can be used for nodes in the cluster for an early access development system.</p>"},{"location":"guides/ha-cluster/readme/#configuring-a-gfs2-file-system-in-a-cluster","title":"Configuring a GFS2 file system in a cluster","text":"<p>Follow steps 1-8 of the procedure from Red Hat: Configuring a GFS2 file system in a cluster.</p>"},{"location":"guides/initial-setup/readme/","title":"Initial Setup Instructions","text":"<p>Instructions for the initial setup of a Rabbit are included in this document.</p>"},{"location":"guides/initial-setup/readme/#lvm-configuration-on-rabbit","title":"LVM Configuration on Rabbit","text":"LVM Details <p>Running LVM commands (lvcreate/lvremove) on a Rabbit to create logical volumes is problematic if those commands run within a container. Rabbit Storage Orchestration   code contained in the <code>nnf-node-manager</code> Kubernetes pod executes LVM commands from within the container. The problem is that the LVM create/remove commands wait for a   UDEV confirmation cookie that is set when UDEV rules run within the host OS. These cookies are not synchronized with the containers where the LVM commands execute.</p> <p>3 options to solve this problem are:</p> <ol> <li>Disable UDEV sync at the host operating system level</li> <li>Disable UDEV sync using the <code>\u2013noudevsync</code> command option for each LVM command</li> <li>Clear the UDEV cookie using the <code>dmsetup udevcomplete_all</code> command after the lvcreate/lvremove command.</li> </ol> <p>Taking these in reverse order using option 3 above which allows UDEV settings within the host OS to remain unchanged from the default, one would need to start the   <code>dmsetup</code> command on a separate thread because the LVM create/remove command waits for the UDEV cookie. This opens too many error paths, so it was rejected.</p> <p>Option 2 allows UDEV settings within the host OS to remain unchanged from the default, but the use of UDEV within production Rabbit systems is viewed as unnecessary   because the host OS is PXE-booted onto the node vs loaded from an device that is discovered by UDEV.</p> <p>Option 1 above is what we chose to implement because it is the simplest. The following sections discuss this setting.</p> <p>In order for LVM commands to run within the container environment on a Rabbit, the following change is required to the <code>/etc/lvm/lvm.conf</code> file on Rabbit.</p> <pre><code>sed -i 's/udev_sync = 1/udev_sync = 0/g' /etc/lvm/lvm.conf\n</code></pre>"},{"location":"guides/initial-setup/readme/#zfs","title":"ZFS","text":"<p>ZFS kernel module must be enabled to run on boot. This can be done by creating a file, <code>zfs.conf</code>, containing the string \"zfs\" in your systems modules-load.d directory.</p> <pre><code>echo \"zfs\" &gt; /etc/modules-load.d/zfs.conf\n</code></pre>"},{"location":"guides/initial-setup/readme/#kubernetes-initial-setup","title":"Kubernetes Initial Setup","text":"<p>Installation of Kubernetes (k8s) nodes proceeds by installing k8s components onto the master node(s) of the cluster, then installing k8s components onto the worker nodes and joining those workers to the cluster. The k8s cluster setup for Rabbit requires 3 distinct k8s node types for operation:</p> <ul> <li>Master: 1 or more master nodes which serve as the Kubernetes API server and control access to the system. For HA, at least 3 nodes should be dedicated to this role.</li> <li>Worker: 1 or more worker nodes which run the system level controller manager (SLCM) and Data Workflow Services (DWS) pods. In production, at least 3 nodes should be dedicated to this role.</li> <li>Rabbit: 1 or more Rabbit nodes which run the node level controller manager (NLCM) code. The NLCM daemonset pods are exclusively scheduled on Rabbit nodes. All Rabbit nodes are joined to the cluster as k8s workers, and they are tainted to restrict the type of work that may be scheduled on them. The NLCM pod has a toleration that allows it to run on the tainted (i.e. Rabbit) nodes.</li> </ul>"},{"location":"guides/initial-setup/readme/#kubernetes-node-labels","title":"Kubernetes Node Labels","text":"Node Type Node Label Generic Kubernetes Worker Node cray.nnf.manager=true Rabbit Node cray.nnf.node=true"},{"location":"guides/initial-setup/readme/#kubernetes-node-taints","title":"Kubernetes Node Taints","text":"Node Type Node Label Rabbit Node cray.nnf.node=true:NoSchedule <p>See Taints and Tolerations. The SystemConfiguration controller will handle node taints and labels for the rabbit nodes based on the contents of the SystemConfiguration resource described below.</p>"},{"location":"guides/initial-setup/readme/#rabbit-system-configuration","title":"Rabbit System Configuration","text":"<p>The SystemConfiguration Custom Resource Definition (CRD) is a DWS resource that describes the hardware layout of the whole system. It is expected that an administrator creates a single SystemConfiguration resource when the system is being set up. There is no need to update the SystemConfiguration resource unless hardware is added to or removed from the system.</p> System Configuration Details <p>Rabbit software looks for a SystemConfiguration named <code>default</code> in the <code>default</code> namespace. This resource contains a list of compute nodes and storage nodes, and it describes the mapping between them. There are two different consumers of the SystemConfiguration resource in the NNF software:</p> <p><code>NnfNodeReconciler</code> - The reconciler for the NnfNode resource running on the Rabbit nodes reads the SystemConfiguration resource. It uses the Storage to compute mapping information to fill in the HostName section of the NnfNode resource. This information is then used to populate the DWS Storage resource.</p> <p><code>NnfSystemConfigurationReconciler</code> - This reconciler runs in the <code>nnf-controller-manager</code>. It creates a Namespace for each compute node listed in the SystemConfiguration. These namespaces are used by the client mount code.</p> <p>Here is an example <code>SystemConfiguration</code>:</p> Spec Section Notes computeNodes List of names of compute nodes in the system storageNodes List of Rabbits and the compute nodes attached storageNodes[].type Must be \"Rabbit\" storageNodes[].computeAccess List of {slot, compute name} elements that indicate physical slot index that the named compute node is attached to <pre><code>apiVersion: dataworkflowservices.github.io/v1alpha2\nkind: SystemConfiguration\nmetadata:\n  name: default\n  namespace: default\nspec:\n  computeNodes:\n  - name: compute-01\n  - name: compute-02\n  - name: compute-03\n  - name: compute-04\n  ports:\n  - 5000-5999\n  portsCooldownInSeconds: 0\n  storageNodes:\n  - computesAccess:\n    - index: 0\n      name: compute-01\n    - index: 1\n      name: compute-02\n    - index: 6\n      name: compute-03\n    name: rabbit-name-01\n    type: Rabbit\n  - computesAccess:\n    - index: 4\n      name: compute-04\n    name: rabbit-name-02\n    type: Rabbit\n</code></pre>"},{"location":"guides/node-management/drain/","title":"Disable Or Drain A Node","text":""},{"location":"guides/node-management/drain/#disabling-a-node","title":"Disabling a node","text":"<p>A Rabbit node can be manually disabled, indicating to the WLM that it should not schedule more jobs on the node. Jobs currently on the node will be allowed to complete at the discretion of the WLM.</p> <p>Disable a node by setting its Storage state to <code>Disabled</code>.</p> <pre><code>kubectl patch storage $NODE --type=json -p '[{\"op\":\"replace\", \"path\":\"/spec/state\", \"value\": \"Disabled\"}]'\n</code></pre> <p>When the Storage is queried by the WLM, it will show the disabled status.</p> <pre><code>$ kubectl get storages\nNAME           STATE      STATUS     MODE   AGE\nkind-worker2   Enabled    Ready      Live   10m\nkind-worker3   Disabled   Disabled   Live   10m\n</code></pre> <p>To re-enable a node, set its Storage state to <code>Enabled</code>.</p> <pre><code>kubectl patch storage $NODE --type=json -p '[{\"op\":\"replace\", \"path\":\"/spec/state\", \"value\": \"Enabled\"}]'\n</code></pre> <p>The Storage state will show that it is enabled.</p> <pre><code>kubectl get storages\nNAME           STATE     STATUS   MODE   AGE\nkind-worker2   Enabled   Ready    Live   10m\nkind-worker3   Enabled   Ready    Live   10m\n</code></pre>"},{"location":"guides/node-management/drain/#draining-a-node","title":"Draining a node","text":"<p>The NNF software consists of a collection of DaemonSets and Deployments. The pods on the Rabbit nodes are usually from DaemonSets. Because of this, the <code>kubectl drain</code> command is not able to remove the NNF software from a node.  See Safely Drain a Node for details about the limitations posed by DaemonSet pods.</p> <p>Given the limitations of DaemonSets, the NNF software will be drained by using taints, as described in Taints and Tolerations.</p> <p>This would be used only after the WLM jobs have been removed from that Rabbit (preferably) and there is some reason to also remove the NNF software from it. This might be used before a Rabbit is powered off and pulled out of the cabinet, for example, to avoid leaving pods in \"Terminating\" state (harmless, but it's noise).</p> <p>If an admin used this taint before power-off it would mean there wouldn't be \"Terminating\" pods lying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on.</p>"},{"location":"guides/node-management/drain/#drain-nnf-pods-from-a-rabbit-node","title":"Drain NNF pods from a rabbit node","text":"<p>Drain the NNF software from a node by applying the <code>cray.nnf.node.drain</code> taint. The CSI driver pods will remain on the node to satisfy any unmount requests from k8s as it cleans up the NNF pods.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute\n</code></pre> <p>This will cause the node's <code>Storage</code> resource to be drained:</p> <pre><code>$ kubectl get storages\nNAME           STATE     STATUS    MODE   AGE\nkind-worker2   Enabled   Drained   Live   5m44s\nkind-worker3   Enabled   Ready     Live   5m45s\n</code></pre> <p>The <code>Storage</code> resource will contain the following message indicating the reason it has been drained:</p> <pre><code>$ kubectl get storages rabbit1 -o json | jq -rM .status.message\nKubernetes node is tainted with cray.nnf.node.drain\n</code></pre> <p>To restore the node to service, remove the <code>cray.nnf.node.drain</code> taint.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain-\n</code></pre> <p>The <code>Storage</code> resource will revert to a <code>Ready</code> status.</p>"},{"location":"guides/node-management/drain/#the-csi-driver","title":"The CSI driver","text":"<p>While the CSI driver pods may be drained from a Rabbit node, it is inadvisable to do so.</p> <p>Warning K8s relies on the CSI driver to unmount any filesystems that may have been mounted into a pod's namespace. If it is not present when k8s is attempting to remove a pod then the pod may be left in \"Terminating\" state. This is most obvious when draining the <code>nnf-dm-worker</code> pods which usually have filesystems mounted in them.</p> <p>Drain the CSI driver pod from a node by applying the <code>cray.nnf.node.drain.csi</code> taint.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain.csi=true:NoSchedule cray.nnf.node.drain.csi=true:NoExecute\n</code></pre> <p>To restore the CSI driver pods to that node, remove the <code>cray.nnf.node.drain.csi</code> taint.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain.csi-\n</code></pre> <p>This taint will also drain the remaining NNF software if has not already been drained by the <code>cray.nnf.node.drain</code> taint.</p>"},{"location":"guides/node-management/nvme-namespaces/","title":"Debugging NVMe Namespaces","text":""},{"location":"guides/node-management/nvme-namespaces/#total-space-available-or-used","title":"Total Space Available or Used","text":"<p>Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the <code>nnf-node-manager</code> pod on that node.</p> <p>To view the space on node ee50, find its <code>nnf-node-manager</code> pod and then exec into it to query the Redfish API:</p> <pre><code>[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager\nnnf-system             nnf-node-manager-jhglm                               1/1     Running                     0                 61m     10.85.71.11       ee50   &lt;none&gt;           &lt;none&gt;\n</code></pre> <p>Then query the Redfish API to view the <code>AllocatedBytes</code> and <code>GuaranteedBytes</code>:</p> <pre><code>[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq\n{\n  \"@odata.id\": \"/redfish/v1/StorageServices/NNF/CapacitySource\",\n  \"@odata.type\": \"#CapacitySource.v1_0_0.CapacitySource\",\n  \"Id\": \"0\",\n  \"Name\": \"Capacity Source\",\n  \"ProvidedCapacity\": {\n    \"Data\": {\n      \"AllocatedBytes\": 128849888,\n      \"ConsumedBytes\": 128849888,\n      \"GuaranteedBytes\": 307132496928,\n      \"ProvisionedBytes\": 307261342816\n    },\n    \"Metadata\": {},\n    \"Snapshot\": {}\n  },\n  \"ProvidedClassOfService\": {},\n  \"ProvidingDrives\": {},\n  \"ProvidingPools\": {},\n  \"ProvidingVolumes\": {},\n  \"Actions\": {},\n  \"ProvidingMemory\": {},\n  \"ProvidingMemoryChunks\": {}\n}\n</code></pre>"},{"location":"guides/node-management/nvme-namespaces/#total-orphaned-or-leaked-space","title":"Total Orphaned or Leaked Space","text":"<p>To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no <code>NnfNodeBlockStorages</code> in the k8s namespace with the Rabbit's name:</p> <pre><code>[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50\nNo resources found in ee50 namespace.\n</code></pre> <p>To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:</p> <pre><code>[root@ee50:~]# nvme list\nNode                  SN                   Model                                    Namespace Usage                      Format           FW Rev\n--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------\n/dev/nvme0n1          S666NN0TB11877       SAMSUNG MZ1L21T9HCLS-00A07               1           8.57  GB /   1.92  TB    512   B +  0 B   GDC7302Q\n</code></pre> <p>There should be no namespaces on the kioxia drives:</p> <pre><code>[root@ee50:~]# nvme list | grep -i kioxia\n[root@ee50:~]#\n</code></pre> <p>If there are namespaces listed, and there weren't any <code>NnfNodeBlockStorages</code> on the node, then they need to be deleted through the Rabbit software. The <code>NnfNodeECData</code> resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod:</p> <pre><code>kubectl delete nnfnodeecdata ec-data -n ee50\nkubectl delete pod -n nnf-system nnf-node-manager-jhglm\n</code></pre>"},{"location":"guides/rbac-for-users/readme/","title":"RBAC: Role-Based Access Control","text":"<p>RBAC (Role Based Access Control) determines the operations a user or service can perform on a list of Kubernetes resources. RBAC affects everything that interacts with the kube-apiserver (both users and services internal or external to the cluster). More information about RBAC can be found in the Kubernetes documentation.</p>"},{"location":"guides/rbac-for-users/readme/#rbac-for-users","title":"RBAC for Users","text":"<p>This section shows how to create a kubeconfig file with RBAC set up to restrict access to view only for resources.</p>"},{"location":"guides/rbac-for-users/readme/#overview","title":"Overview","text":"<p>User access to a Kubernetes cluster is defined through a kubeconfig file. This file contains the address of the kube-apiserver as well as the key and certificate for the user. Typically this file is located in <code>~/.kube/config</code>. When a kubernetes cluster is created, a config file is generated for the admin that allows unrestricted access to all resources in the cluster. This is the equivalent of <code>root</code> on a Linux system.</p> <p>The goal of this document is to create a new kubeconfig file that allows view only access to Kubernetes resources. This kubeconfig file can be shared between the HPE employees to investigate issues on the system. This involves:</p> <ul> <li>Generating a new key/cert pair for an \"hpe\" user</li> <li>Creating a new kubeconfig file</li> <li>Adding RBAC rules for the \"hpe\" user to allow read access</li> </ul>"},{"location":"guides/rbac-for-users/readme/#generate-a-key-and-certificate","title":"Generate a Key and Certificate","text":"<p>The first step is to create a new key and certificate so that HPE employees can authenticate as the \"hpe\" user. This will likely be done on one of the master nodes. The <code>openssl</code> command needs access to the certificate authority file. This is typically located in <code>/etc/kubernetes/pki</code>.</p> <pre><code># make a temporary work space\nmkdir /tmp/rabbit\ncd /tmp/rabbit\n\n# Create this user\nexport USERNAME=hpe\n\n# generate a new key\nopenssl genrsa -out rabbit.key 2048\n\n# create a certificate signing request for this user\nopenssl req -new -key rabbit.key -out rabbit.csr -subj \"/CN=$USERNAME\"\n\n# generate a certificate using the certificate authority on the k8s cluster. This certificate lasts 500 days\nopenssl x509 -req -in rabbit.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out rabbit.crt -days 500\n</code></pre>"},{"location":"guides/rbac-for-users/readme/#create-a-kubeconfig","title":"Create a kubeconfig","text":"<p>After the keys have been generated, a new kubeconfig file can be created for this user. The admin kubeconfig <code>/etc/kubernetes/admin.conf</code> can be used to determine the cluster name kube-apiserver address.</p> <pre><code># create a new kubeconfig with the server information\nkubectl config set-cluster $CLUSTER_NAME --kubeconfig=/tmp/rabbit/rabbit.conf --server=$SERVER_ADDRESS --certificate-authority=/etc/kubernetes/pki/ca.crt --embed-certs=true\n\n# add the key and cert for this user to the config\nkubectl config set-credentials $USERNAME --kubeconfig=/tmp/rabbit/rabbit.conf --client-certificate=/tmp/rabbit/rabbit.crt --client-key=/tmp/rabbit/rabbit.key --embed-certs=true\n\n# add a context\nkubectl config set-context $USERNAME --kubeconfig=/tmp/rabbit/rabbit.conf --cluster=$CLUSTER_NAME --user=$USERNAME\n</code></pre> <p>The kubeconfig file should be placed in a location where HPE employees have read access to it.</p>"},{"location":"guides/rbac-for-users/readme/#create-clusterrole-and-clusterrolebinding","title":"Create ClusterRole and ClusterRoleBinding","text":"<p>The next step is to create ClusterRole and ClusterRoleBinding resources. The ClusterRole provided allows viewing all cluster and namespace scoped resources, but disallows creating, deleting, or modifying any resources.</p> <p>ClusterRole <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRole\nmetadata:\n  name: hpe-viewer\nrules:\n  - apiGroups: [ \"*\" ]\n    resources: [ \"*\" ]\n    verbs: [ get, list ]\n</code></pre></p> <p>ClusterRoleBinding <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: hpe-viewer\nsubjects:\n- kind: User\n  name: hpe\n  apiGroup: rbac.authorization.k8s.io\nroleRef:\n  kind: ClusterRole\n  name: hpe-viewer\n  apiGroup: rbac.authorization.k8s.io\n</code></pre></p> <p>Both of these resources can be created using the <code>kubectl apply</code> command.</p>"},{"location":"guides/rbac-for-users/readme/#testing","title":"Testing","text":"<p>Get, List, Create, Delete, and Modify operations can be tested as the \"hpe\" user by setting the KUBECONFIG environment variable to use the new kubeconfig file. Get and List should be the only allowed operations. Other operations should fail with a \"forbidden\" error.</p> <pre><code>export KUBECONFIG=/tmp/hpe/hpe.conf\n</code></pre>"},{"location":"guides/rbac-for-users/readme/#rbac-for-workload-manager-wlm","title":"RBAC for Workload Manager (WLM)","text":"<p>Note This section assumes the reader has read and understood the steps described above for setting up <code>RBAC for Users</code>.</p> <p>A workload manager (WLM) such as Flux or Slurm will interact with DataWorkflowServices as a privileged user. RBAC is used to limit the operations that a WLM can perform on a Rabbit system.</p> <p>The following steps are required to create a user and a role for the WLM.  In this case, we're creating a user to be used with the Flux WLM:</p> <ul> <li>Generate a new key/cert pair for a \"flux\" user</li> <li>Creating a new kubeconfig file</li> <li>Adding RBAC rules for the \"flux\" user to allow appropriate access to the DataWorkflowServices API.</li> </ul>"},{"location":"guides/rbac-for-users/readme/#generate-a-key-and-certificate_1","title":"Generate a Key and Certificate","text":"<p>Generate a key and certificate for our \"flux\" user, similar to the way we created one for the \"hpe\" user above.  Substitute \"flux\" in place of \"hpe\".</p>"},{"location":"guides/rbac-for-users/readme/#create-a-kubeconfig_1","title":"Create a kubeconfig","text":"<p>After the keys have been generated, a new kubeconfig file can be created for the \"flux\" user, similar to the one for the \"hpe\" user above.  Again, substitute \"flux\" in place of \"hpe\".</p>"},{"location":"guides/rbac-for-users/readme/#use-the-provided-clusterrole-and-create-a-clusterrolebinding","title":"Use the provided ClusterRole and create a ClusterRoleBinding","text":"<p>DataWorkflowServices has already defined the role to be used with WLMs, named <code>dws-workload-manager</code>:</p> <pre><code>kubectl get clusterrole dws-workload-manager\n</code></pre> <p>If the \"flux\" user requires only the normal WLM permissions, then create and apply a ClusterRoleBinding to associate the \"flux\" user with the <code>dws-workload-manager</code> ClusterRole.</p> <p>The `dws-workload-manager role is defined in workload_manager_role.yaml.</p> <p>ClusterRoleBinding for WLM permissions only: <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: flux\nsubjects:\n- kind: User\n  name: flux\n  apiGroup: rbac.authorization.k8s.io\nroleRef:\n  kind: ClusterRole\n  name: dws-workload-manager\n  apiGroup: rbac.authorization.k8s.io\n</code></pre></p> <p>If the \"flux\" user requires the normal WLM permissions as well as some of the NNF permissions, perhaps to collect some NNF resources for debugging, then create and apply a ClusterRoleBinding to associate the \"flux\" user with the <code>nnf-workload-manager</code> ClusterRole.</p> <p>The <code>nnf-workload-manager</code> role is defined in workload_manager_nnf_role.yaml.</p> <p>ClusterRoleBinding for WLM and NNF permissions: <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: flux\nsubjects:\n- kind: User\n  name: flux\n  apiGroup: rbac.authorization.k8s.io\nroleRef:\n  kind: ClusterRole\n  name: nnf-workload-manager\n  apiGroup: rbac.authorization.k8s.io\n</code></pre></p> <p>The WLM should then use the kubeconfig file associated with this \"flux\" user to access the DataWorkflowServices API and the Rabbit system.</p>"},{"location":"guides/storage-profiles/readme/","title":"Storage Profile Overview","text":"<p>Storage Profiles allow for customization of the Rabbit storage provisioning process. Examples of content that can be customized via storage profiles is</p> <ol> <li>The RAID type used for storage</li> <li>Any mkfs or LVM args used</li> <li>An external MGS NID for Lustre</li> <li>A boolean value indicating the Lustre MGT and MDT should be combined on the same target device </li> </ol> <p>DW directives that allocate storage on Rabbit nodes allow a <code>profile</code> parameter to be specified to control how the storage is configured. NNF software provides a set of canned profiles to choose from, and the administrator may create more profiles.</p> <p>The administrator shall choose one profile to be the default profile that is used when a profile parameter is not specified.</p>"},{"location":"guides/storage-profiles/readme/#specifying-a-profile","title":"Specifying a Profile","text":"<p>To specify a profile name on a #DW directive, use the <code>profile</code> option <pre><code>#DW jobdw type=lustre profile=durable capacity=5GB name=example\n</code></pre></p>"},{"location":"guides/storage-profiles/readme/#setting-a-default-profile","title":"Setting A Default Profile","text":"<p>A default profile must be defined at all times. Any #DW line that does not specify a profile will use the default profile. If a default profile is not defined, then any new workflows will be rejected. If more than one profile is marked as default then any new workflows will be rejected.</p> <p>To query existing profiles</p> <pre><code>$ kubectl get nnfstorageprofiles -A\nNAMESPACE    NAME          DEFAULT   AGE\nnnf-system   durable       true      14s\nnnf-system   performance   false     6s\n</code></pre> <p>To set the default flag on a profile <pre><code>$ kubectl patch nnfstorageprofile performance -n nnf-system --type merge -p '{\"data\":{\"default\":true}}'\n</code></pre></p> <p>To clear the default flag on a profile <pre><code>$ kubectl patch nnfstorageprofile durable -n nnf-system --type merge -p '{\"data\":{\"default\":false}}'\n</code></pre></p>"},{"location":"guides/storage-profiles/readme/#creating-the-initial-default-profile","title":"Creating The Initial Default Profile","text":"<p>Create the initial default profile from scratch or by using the NnfStorageProfile/template resource as a template. If <code>nnf-deploy</code> was used to install nnf-sos then the default profile described below will have been created automatically.</p> <p>To use the <code>template</code> resource begin by obtaining a copy of it either from the nnf-sos repo or from a live system. To get it from a live system use the following command:</p> <pre><code>kubectl get nnfstorageprofile -n nnf-system template -o yaml &gt; profile.yaml\n</code></pre> <p>Edit the <code>profile.yaml</code> file to trim the metadata section to contain only a name and namespace. The namespace must be left as nnf-system, but the name should be set to signify that this is the new default profile. In this example we will name it <code>default</code>.  The metadata section will look like the following, and will contain no other fields:</p> <pre><code>metadata:\n  name: default\n  namespace: nnf-system\n</code></pre> <p>Mark this new profile as the default profile by setting <code>default: true</code> in the data section of the resource:</p> <pre><code>data:\n  default: true\n</code></pre> <p>Apply this resource to the system and verify that it is the only one marked as the default resource:</p> <pre><code>kubectl get nnfstorageprofile -A\n</code></pre> <p>The output will appear similar to the following:</p> <pre><code>NAMESPACE    NAME       DEFAULT   AGE\nnnf-system   default    true      9s\nnnf-system   template   false     11s\n</code></pre> <p>The administrator should edit the <code>default</code> profile to record any cluster-specific settings. Maintain a copy of this resource YAML in a safe place so it isn't lost across upgrades.</p>"},{"location":"guides/storage-profiles/readme/#keeping-the-default-profile-updated","title":"Keeping The Default Profile Updated","text":"<p>An upgrade of nnf-sos may include updates to the <code>template</code> profile. It may be necessary to manually copy these updates into the <code>default</code> profile.</p>"},{"location":"guides/storage-profiles/readme/#profile-parameters","title":"Profile Parameters","text":""},{"location":"guides/storage-profiles/readme/#xfs","title":"XFS","text":"<p>The following shows how to specify command line options for pvcreate, vgcreate, lvcreate, and mkfs for XFS storage. Optional mount options are specified one per line</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: xfs-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  xfsStorage:\n    commandlines:\n      pvCreate: $DEVICE\n      vgCreate: $VG_NAME $DEVICE_LIST\n      lvCreate: -l 100%VG --stripes $DEVICE_NUM --stripesize=32KiB --name $LV_NAME $VG_NAME\n      mkfs: $DEVICE\n    options:\n      mountRabbit:\n      - noatime\n      - nodiratime\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#gfs2","title":"GFS2","text":"<p>The following shows how to specify command line options for pvcreate, lvcreate, and mkfs for GFS2.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: gfs2-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  gfs2Storage:\n    commandlines:\n      pvCreate: $DEVICE\n      vgCreate: $VG_NAME $DEVICE_LIST\n      lvCreate: -l 100%VG --stripes $DEVICE_NUM --stripesize=32KiB --name $LV_NAME $VG_NAME\n      mkfs: -j2 -p $PROTOCOL -t $CLUSTER_NAME:$LOCK_SPACE $DEVICE\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#lustre-zfs","title":"Lustre / ZFS","text":"<p>The following shows how to specify a zpool virtual device (vdev). In this case the default vdev is a stripe. See zpoolconcepts(7) for virtual device descriptions.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: zpool-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    mgtCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --mgs $VOL_NAME\n    mdtCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --mdt --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n    mgtMdtCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --mgs --mdt --fsname=$FS_NAME --index=$INDEX $VOL_NAME\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#zfs-dataset-properties","title":"ZFS dataset properties","text":"<p>The following shows how to specify ZFS dataset properties in the <code>--mkfsoptions</code> arg for mkfs.lustre. See zfsprops(7).</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: zpool-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n[...]\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --mkfsoptions=\"recordsize=1024K -o compression=lz4\" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#mount-options-for-targets","title":"Mount Options for Targets","text":""},{"location":"guides/storage-profiles/readme/#persistent-mount-options","title":"Persistent Mount Options","text":"<p>Use the mkfs.lustre <code>--mountfsoptions</code> parameter to set persistent mount options for Lustre targets.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: target-mount-option-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n[...]\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --mountfsoptions=\"errors=remount-ro,mballoc\" --mkfsoptions=\"recordsize=1024K -o compression=lz4\" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#non-persistent-mount-options","title":"Non-Persistent Mount Options","text":"<p>Non-persistent mount options can be specified with the ostOptions.mountTarget parameter to the NnfStorageProfile:</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: target-mount-option-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n[...]\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --mountfsoptions=\"errors=remount-ro\" --mkfsoptions=\"recordsize=1024K -o compression=lz4\" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n    ostOptions:\n      mountTarget:\n      - mballoc\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#target-layout","title":"Target Layout","text":"<p>Users may want Lustre file systems with different performance characteristics. For example, a user job with a single compute node accessing the Lustre file system would see acceptable performance from a single OSS. An FPP workload might want as many OSSs as posible to avoid contention.</p> <p>The <code>NnfStorageProfile</code> allows admins to specify where and how many Lustre targets are allocated by the WLM. During the proposal phase of the workflow, the NNF software uses the information in the <code>NnfStorageProfile</code> to add extra constraints in the <code>DirectiveBreakdown</code>. The WLM uses these constraints when picking storage.</p> <p>The <code>NnfStorageProfile</code> has three fields in the <code>mgtOptions</code>, <code>mdtOptions</code>, and <code>ostOptions</code> to specify target layout. The fields are:</p> <ul> <li><code>count</code> - A static value for how many Lustre targets to create.</li> <li><code>scale</code> - A value from 1-10 that the WLM can use to determine how many Lustre targets to allocate. This is up to the WLM and the admins to agree on how to interpret this field. A value of 1 might indicate the minimum number of NNF nodes needed to reach the minimum capacity, while 10 might result in a Lustre target on every Rabbit attached to the computes in the job. Scale takes into account allocation size, compute node count, and Rabbit count.</li> <li><code>colocateComputes</code> - true/false value. When \"true\", this adds a location constraint in the <code>DirectiveBreakdown</code> that limits the WLM to picking storage with a physical connection to the compute resources. In practice this means that Rabbit storage is restricted to the chassis used by the job. This can be set individually for each of the Lustre target types. When this is \"false\", any Rabbit storage can be picked, even if the Rabbit doesn't share a chassis with any of the compute nodes in the job.</li> </ul> <p>Only one of <code>scale</code> and <code>count</code> can be set for a particular target type.</p> <p>The <code>DirectiveBreakdown</code> for <code>create_persistent</code> #DWs won't include the constraint from <code>colocateCompute=true</code> since there may not be any compute nodes associated with the job.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: high-metadata\n  namespace: default\ndata:\n  default: false\n...\n  lustreStorage:\n    combinedMgtMdt: false\n    capacityMdt: 500GiB\n    capacityMgt: 1GiB\n[...]\n    ostOptions:\n      scale: 5\n      colocateComputes: true\n    mdtOptions:\n      count: 10\n</code></pre>"},{"location":"guides/storage-profiles/readme/#example-layouts","title":"Example Layouts","text":"<p><code>scale</code> with <code>colocateComputes=true</code> will likely be the most common layout type to use for <code>jobdw</code> directives. This will result in a Lustre file system whose performance scales with the number of compute nodes in the job.</p> <p><code>count</code> may be used when a specific performance characteristic is desired such as a single shared file workload that has low metadata requirements and only needs a single MDT. It may also be useful when a consistently performing file system is required across different jobs.</p> <p><code>colocatedComputes=false</code> may be useful for placing MDTs on NNF nodes without an OST (within the same file system).</p> <p>The <code>count</code> field may be useful when creating a persistent file system since the job with the <code>create_persistent</code> directive may only have a single compute node.</p> <p>In general, <code>scale</code> gives a simple way for users to get a filesystem that has performance consistent with their job size. <code>count</code> is useful for times when a user wants full control of the file system layout.</p>"},{"location":"guides/storage-profiles/readme/#command-line-variables","title":"Command Line Variables","text":""},{"location":"guides/storage-profiles/readme/#pvcreate","title":"pvcreate","text":"<ul> <li><code>$DEVICE</code> - expands to the <code>/dev/&lt;path&gt;</code> value for one device that has been allocated</li> </ul>"},{"location":"guides/storage-profiles/readme/#vgcreate","title":"vgcreate","text":"<ul> <li><code>$VG_NAME</code> - expands to a volume group name that is controlled by Rabbit software.</li> <li><code>$DEVICE_LIST</code> - expands to a list of space-separated <code>/dev/&lt;path&gt;</code> devices. This list will contain the devices that were iterated over for the pvcreate step.</li> </ul>"},{"location":"guides/storage-profiles/readme/#lvcreate","title":"lvcreate","text":"<ul> <li><code>$VG_NAME</code> - see vgcreate above.</li> <li><code>$LV_NAME</code> - expands to a logical volume name that is controlled by Rabbit software.</li> <li><code>$DEVICE_NUM</code> - expands to a number indicating the number of devices allocated for the volume group.</li> <li><code>$DEVICE1, $DEVICE2, ..., $DEVICEn</code> - each expands to one of the devices from the <code>$DEVICE_LIST</code> above.</li> </ul>"},{"location":"guides/storage-profiles/readme/#xfs-mkfs","title":"XFS mkfs","text":"<ul> <li><code>$DEVICE</code> - expands to the <code>/dev/&lt;path&gt;</code> value for the logical volume that was created by the lvcreate step above.</li> </ul>"},{"location":"guides/storage-profiles/readme/#gfs2-mkfs","title":"GFS2 mkfs","text":"<ul> <li><code>$DEVICE</code> - expands to the <code>/dev/&lt;path&gt;</code> value for the logical volume that was created by the lvcreate step above.</li> <li><code>$CLUSTER_NAME</code> - expands to a cluster name that is controlled by Rabbit Software</li> <li><code>$LOCK_SPACE</code> - expands to a lock space key that is controlled by Rabbit Software.</li> <li><code>$PROTOCOL</code> - expands to a locking protocol that is controlled by Rabbit Software.</li> </ul>"},{"location":"guides/storage-profiles/readme/#zpool-create","title":"zpool create","text":"<ul> <li><code>$DEVICE_LIST</code> - expands to a list of space-separated <code>/dev/&lt;path&gt;</code> devices. This list will contain the devices that were allocated for this storage request.</li> <li><code>$POOL_NAME</code> - expands to a pool name that is controlled by Rabbit software.</li> <li><code>$DEVICE_NUM</code> - expands to a number indicating the number of devices allocated for this storage request.</li> <li><code>$DEVICE1, $DEVICE2, ..., $DEVICEn</code> - each expands to one of the devices from the <code>$DEVICE_LIST</code> above.</li> </ul>"},{"location":"guides/storage-profiles/readme/#lustre-mkfs","title":"lustre mkfs","text":"<ul> <li><code>$FS_NAME</code> - expands to the filesystem name that was passed to Rabbit software from the workflow's #DW line.</li> <li><code>$MGS_NID</code> - expands to the NID of the MGS. If the MGS was orchestrated by nnf-sos then an appropriate internal value will be used.</li> <li><code>$POOL_NAME</code> - see zpool create above.</li> <li><code>$VOL_NAME</code> - expands to the volume name that will be created. This value will be <code>&lt;pool_name&gt;/&lt;dataset&gt;</code>, and is controlled by Rabbit software.</li> <li><code>$INDEX</code> - expands to the index value of the target and is controlled by Rabbit software.</li> </ul>"},{"location":"guides/user-containers/readme/","title":"NNF User Containers","text":"<p>NNF User Containers are a mechanism to allow user-defined containerized applications to be run on Rabbit nodes with access to NNF ephemeral and persistent storage.</p>"},{"location":"guides/user-containers/readme/#overview","title":"Overview","text":"<p>Container workflows are orchestrated through the use of two components: Container Profiles and Container Directives. A Container Profile defines the container to be executed. Most importantly, it allows you to specify which NNF storages are accessible within the container and which container image to run. The containers are executed on the NNF nodes that are allocated to your container workflow. These containers can be executed in either of two modes: Non-MPI and MPI.</p> <p>For Non-MPI applications, the image and command are launched across all the targeted NNF Nodes in a uniform manner. This is useful in simple applications, where non-distributed behavior is desired.</p> <p>For MPI applications, a single launcher container serves as the point of contact, responsible for distributing tasks to various worker containers. Each of the NNF nodes targeted by the workflow receives its corresponding worker container. The focus of this documentation will be on MPI applications.</p> <p>To see a full working example before diving into these docs, see Putting It All Together.</p>"},{"location":"guides/user-containers/readme/#before-creating-a-container-workflow","title":"Before Creating a Container Workflow","text":"<p>Before creating a workflow, a working <code>NnfContainerProfile</code> must exist. This profile is referenced in the container directive supplied with the workflow.</p>"},{"location":"guides/user-containers/readme/#container-profiles","title":"Container Profiles","text":"<p>The author of a containerized application will work with the administrator to define a pod specification template for the container and to create an appropriate <code>NnfContainerProfile</code> resource for the container. The image and tag for the user's container will be specified in the profile.</p> <p>The image must be available in a registry that is available to your system. This could be docker.io, ghcr.io, etc., or a private registry. Note that for a private registry, some additional setup is required. See here for more info.</p> <p>The image itself has a few requirements. See here for more info on building images.</p> <p>New <code>NnfContainerProfile</code> resources may be created by copying one of the provided example profiles from the <code>nnf-system</code> namespace . The examples may be found by listing them with <code>kubectl</code>:</p> <pre><code>kubectl get nnfcontainerprofiles -n nnf-system\n</code></pre> <p>The next few subsections provide an overview of the primary components comprising an <code>NnfContainerProfile</code>. However, it's important to note that while these sections cover the key aspects, they don't encompass every single detail. For an in-depth understanding of the capabilities offered by container profiles, we recommend referring to the following resources:</p> <ul> <li>Type definition for <code>NnfContainerProfile</code></li> <li>Sample for <code>NnfContainerProfile</code></li> <li>Online Examples for <code>NnfContainerProfile</code> (same as <code>kubectl get</code> above)</li> </ul>"},{"location":"guides/user-containers/readme/#container-storages","title":"Container Storages","text":"<p>The <code>Storages</code> defined in the profile allow NNF filesystems to be made available inside of the container. These storages need to be referenced in the container workflow unless they are marked as optional.</p> <p>There are three types of storages available to containers:</p> <ul> <li>local non-persistent storage (created via <code>#DW jobdw</code> directives)</li> <li>persistent storage (created via <code>#DW create_persistent</code> directives)</li> <li>global lustre storage (defined by <code>LustreFilesystems</code>)</li> </ul> <p>For local and persistent storage, only GFS2 and Lustre filesystems are supported. Raw and XFS filesystems cannot be mounted more than once, so they cannot be mounted inside of a container while also being mounted on the NNF node itself.</p> <p>For each storage in the profile, the name must follow these patterns (depending on the storage type):</p> <ul> <li><code>DW_JOB_&lt;storage_name&gt;</code></li> <li><code>DW_PERSISTENT_&lt;storage_name&gt;</code></li> <li><code>DW_GLOBAL_&lt;storage_name&gt;</code></li> </ul> <p><code>&lt;storage_name&gt;</code> is provided by the user and needs to be a name compatible with Linux environment variables (so underscores must be used, not dashes), since the storage mount directories are provided to the container via environment variables.</p> <p>This storage name is used in container workflow directives to reference the NNF storage name that defines the filesystem. Find more info on that in Creating a Container Workflow.</p> <p>Storages may be deemed as <code>optional</code> in a profile. If a storage is not optional, the storage name must be set to the name of an NNF filesystem name in the container workflow.</p> <p>For global lustre, there is an additional field for <code>pvcMode</code>, which must match the mode that is configured in the <code>LustreFilesystem</code> resource that represents the global lustre filesystem. This defaults to <code>ReadWriteMany</code>.</p> <p>Example:</p> <pre><code>  storages:\n  - name: DW_JOB_foo_local_storage\n    optional: false\n  - name: DW_PERSISTENT_foo_persistent_storage\n    optional: true\n  - name: DW_GLOBAL_foo_global_lustre\n    optional: true\n    pvcMode: ReadWriteMany\n</code></pre>"},{"location":"guides/user-containers/readme/#container-spec","title":"Container Spec","text":"<p>As mentioned earlier, container workflows can be categorized into two types: MPI and Non-MPI. It's essential to choose and define only one of these types within the container profile. Regardless of the type chosen, the data structure that implements the specification is equipped with two \"standard\" resources that are distinct from NNF custom resources.</p> <p>For Non-MPI containers, the specification utilizes the <code>spec</code> resource. This is the standard Kubernetes <code>PodSpec</code> that outlines the desired configuration for the pod.</p> <p>For MPI containers, <code>mpiSpec</code> is used. This custom resource, available through <code>MPIJobSpec</code> from <code>mpi-operator</code>, serves as a facilitator for executing MPI applications across worker containers. This resource can be likened to a wrapper around a <code>PodSpec</code>, but users need to define a <code>PodSpec</code> for both Launcher and Worker containers.</p> <p>See the <code>MPIJobSpec</code> definition for more details on what can be configured for an MPI application.</p> <p>It's important to bear in mind that the NNF Software is designed to override specific values within the <code>MPIJobSpec</code> for ensuring the desired behavior in line with NNF software requirements. To prevent complications, it's advisable not to delve too deeply into the specification. A few illustrative examples of fields that are overridden by the NNF Software include:</p> <ul> <li>Replicas</li> <li>RunPolicy.BackoffLimit</li> <li>Worker/Launcher.RestartPolicy</li> <li>SSHAuthMountPath</li> </ul> <p>By keeping these considerations in mind and refraining from extensive alterations to the specification, you can ensure a smoother integration with the NNF Software and mitigate any potential issues that may arise.</p> <p>Please see the Sample and Examples listed above for more detail on container Specs.</p>"},{"location":"guides/user-containers/readme/#container-ports","title":"Container Ports","text":"<p>Container Profiles allow for ports to be reserved for a container workflow. <code>numPorts</code> can be used to specify the number of ports needed for a container workflow. The ports are opened on each targeted NNF node and are accessible outside of the cluster. Users must know how to contact the specific NNF node. It is recommend that DNS entries are made for this purpose.</p> <p>In the workflow, the allocated port numbers are made available via the <code>NNF_CONTAINER_PORTS</code> environment variable.</p> <p>The workflow requests this number of ports from the <code>NnfPortManager</code>, which is responsible for managing the ports allocated to container workflows. This resource can be inspected to see which ports are allocated.</p> <p>Once a port is assigned to a workflow, that port number becomes unavailable for use by any other workflow until it is released.</p> <p>Note</p> <p>The <code>SystemConfiguration</code> must be configured to allow for a range of ports, otherwise container workflows will fail in the <code>Setup</code> state due to insufficient resources. See SystemConfiguration Setup.</p>"},{"location":"guides/user-containers/readme/#systemconfiguration-setup","title":"SystemConfiguration Setup","text":"<p>In order for container workflows to request ports from the <code>NnfPortManager</code>, the <code>SystemConfiguration</code> must be configured for a range of ports:</p> <pre><code>kind: SystemConfiguration\nmetadata:\n  name: default\n  namespace: default\nspec:\n  # Ports is the list of ports available for communication between nodes in the\n  # system. Valid values are single integers, or a range of values of the form\n  # \"START-END\" where START is an integer value that represents the start of a\n  # port range and END is an integer value that represents the end of the port\n  # range (inclusive).\n  ports:\n    - 4000-4999\n  # PortsCooldownInSeconds is the number of seconds to wait before a port can be\n  # reused. Defaults to 60 seconds (to match the typical value for the kernel's\n  # TIME_WAIT). A value of 0 means the ports can be reused immediately.\n  # Defaults to 60s if not set.\n  portsCooldownInSeconds: 60\n</code></pre> <p><code>ports</code> is empty by default, and must be set by an administrator.</p> <p>Multiple port ranges can be specified in this list, as well as single integers. This must be a safe port range that does not interfere with the ephemeral port range of the Linux kernel. The range should also account for the estimated number of simultaneous users that are running container workflows.</p> <p>Once a container workflow is done, the port is released and the <code>NnfPortManager</code> will not allow reuse of the port until the amount of time specified by <code>portsCooldownInSeconds</code> has elapsed. Then the port can be reused by another container workflow.</p>"},{"location":"guides/user-containers/readme/#restricting-to-user-id-or-group-id","title":"Restricting To User ID or Group ID","text":"<p>New NnfContainerProfile resources may be restricted to a specific user ID or group ID . When a <code>data.userID</code> or <code>data.groupID</code> is specified in the profile, only those Workflow resources having a matching user ID or group ID will be allowed to use that profile . If the profile specifies both of these IDs, then the Workflow resource must match both of them.</p>"},{"location":"guides/user-containers/readme/#creating-a-container-workflow","title":"Creating a Container Workflow","text":"<p>The user's workflow will specify the name of the <code>NnfContainerProfile</code> in a DW directive. If the custom profile is named <code>red-rock-slushy</code> then it will be specified in the <code>#DW container</code> directive with the <code>profile</code> parameter.</p> <pre><code>#DW container profile=red-rock-slushy  [...]\n</code></pre> <p>Furthermore, to set the container storages for the workflow, storage parameters must also be supplied in the workflow. This is done using the <code>&lt;storage_name&gt;</code> (see Container Storages) and setting it to the name of a storage directive that defines an NNF filesystem. That storage directive must already exist as part of another workflow (e.g. persistent storage) or it can be supplied in the same workflow as the container. For global lustre, the <code>LustreFilesystem</code> must exist that represents the global lustre filesystem.</p> <p>In this example, we're creating a GFS2 filesystem to accompany the container directive. We're using the <code>red-rock-slushy</code> profile which contains a non-optional storage called <code>DW_JOB_local_storage</code>:</p> <pre><code>kind: NnfContainerProfile\nmetadata:\n  name: red-rock-slushy\ndata:\n  storages:\n  - name: DW_JOB_local_storage\n    optional: false\n  template:\n    mpiSpec:\n      ...\n</code></pre> <p>The resulting container directive looks like this:</p> <pre><code>#DW jobdw name=my-gfs2 type=gfs2 capacity=100GB\"\n#DW container name=my-container profile=red-rock-slushy DW_JOB_local_storage=my-gfs2\n</code></pre> <p>Once the workflow progresses, this will create a 100GB GFS2 filesystem that is then mounted into the container upon creation. An environment variable called <code>DW_JOB_local_storage</code> is made available inside of the container and provides the path to the mounted NNF GFS2 filesystem. An application running inside of the container can then use this variable to get to the filesystem mount directory. See here.</p> <p>Multiple storages can be defined in the container directives. Only one container directive is allowed per workflow.</p> <p>Note</p> <p>GFS2 filesystems have special considerations since the mount directory contains directories for every compute node. See GFS2 Index Mounts for more info.</p>"},{"location":"guides/user-containers/readme/#targeting-nodes","title":"Targeting Nodes","text":"<p>For container directives, compute nodes must be assigned to the workflow. The NNF software will trace the compute nodes back to their local NNF nodes and the containers will be executed on those NNF nodes. The act of assigning compute nodes to your container workflow instructs the NNF software to select the NNF nodes that run the containers.</p> <p>For the <code>jobdw</code> directive that is included above, the servers (i.e. NNF nodes) must also be assigned along with the computes.</p>"},{"location":"guides/user-containers/readme/#running-a-container-workflow","title":"Running a Container Workflow","text":"<p>Once the workflow is created, the WLM progresses it through the following states. This is a quick overview of the container-related behavior that occurs:</p> <ul> <li>Proposal: Verify storages are provided according to the container profile.</li> <li>Setup: If applicable, request ports from NnfPortManager.</li> <li>DataIn: No container related activity.</li> <li>PreRun: Appropriate <code>MPIJob</code> or <code>Job(s)</code> are created for the workflow. In turn, user containers are created and launched by Kubernetes. Containers are expected to start in this state.</li> <li>PostRun: Once in PostRun, user containers are expected to complete (non-zero exit) successfully.</li> <li>DataOut: No container related activity.</li> <li>Teardown: Ports are released; <code>MPIJob</code> or <code>Job(s)</code> are deleted, which in turn deletes the user containers.</li> </ul> <p>The two main states of a container workflow (i.e. PreRun, PostRun) are discussed further in the following sections.</p>"},{"location":"guides/user-containers/readme/#prerun","title":"PreRun","text":"<p>In PreRun, the containers are created and expected to start. Once the containers reach a non-initialization state (i.e. Running), the containers are considered to be started and the workflow can advance.</p> <p>By default, containers are expected to start within 60 seconds. If not, the workflow reports an Error that the containers cannot be started. This value is configurable via the <code>preRunTimeoutSeconds</code> field in the container profile.</p> <p>To summarize the PreRun behavior:</p> <ul> <li>If the container starts successfully (running), transition to <code>Completed</code> status.</li> <li>If the container fails to start, transition to the <code>Error</code> status.</li> <li>If the container is initializing and has not started after <code>preRunTimeoutSeconds</code> seconds, terminate the container and transition to the <code>Error</code> status.</li> </ul>"},{"location":"guides/user-containers/readme/#init-containers","title":"Init Containers","text":"<p>The NNF Software injects Init Containers into the container specification to perform initialization tasks. These containers must run to completion before the main container can start.</p> <p>These initialization tasks include:</p> <ul> <li>Ensuring the proper permissions (i.e. UID/GID) are available in the main container</li> <li>For MPI jobs, ensuring the launcher pod can contact each worker pod via DNS</li> </ul>"},{"location":"guides/user-containers/readme/#prerun-completed","title":"PreRun Completed","text":"<p>Once PreRun has transitioned to <code>Completed</code> status, the user container is now running and the WLM should initiate applications on the compute nodes. Utilizing container ports, the applications on the compute nodes can establish communication with the user containers, which are running on the local NNF node attached to the computes.</p> <p>This communication allows for the compute node applications to drive certain behavior inside of the user container. For example, once the compute node application is complete, it can signal to the user container that it is time to perform cleanup or data migration action.</p>"},{"location":"guides/user-containers/readme/#postrun","title":"PostRun","text":"<p>In PostRun, the containers are expected to exit cleanly with a zero exit code. If a container fails to exit cleanly, the Kubernetes software attempts a number of retries based on the configuration of the container profile. It continues to do this until the container exits successfully, or until the <code>retryLimit</code> is hit - whichever occurs first. In the latter case, the workflow reports an Error.</p> <p>Read up on the Failure Retries for more information on retries.</p> <p>Furthermore, the container profile features a <code>postRunTimeoutSeconds</code> field. If this timeout is reached before the container successfully exits, it triggers an <code>Error</code> status. The timer for this timeout begins upon entry into the PostRun phase, allowing the containers the specified period to execute before the workflow enters an <code>Error</code> status.</p> <p>To recap the PostRun behavior:</p> <ul> <li>If the container exits successfully, transition to <code>Completed</code> status.</li> <li>If the container exits unsuccessfully after <code>retryLimit</code> number of retries, transition to the <code>Error</code> status.</li> <li>If the container is running and has not exited after <code>postRunTimeoutSeconds</code> seconds, terminate the container and transition to the <code>Error</code> status.</li> </ul>"},{"location":"guides/user-containers/readme/#failure-retries","title":"Failure Retries","text":"<p>If a container fails (non-zero exit code), the Kubernetes software implements retries. The number of retries can be set via the <code>retryLimit</code> field in the container profile. If a non-zero exit code is detected, the Kubernetes software creates a new instance of the pod and retries. The default number of retries for <code>retryLimit</code> is set to 6, which is the default value for Kubernetes Jobs. This means that if the pods fails every single time, there will be 7 failed pods in total since it attempted 6 retries after the first failure.</p> <p>To understand this behavior more, see Pod backoff failure policy in the Kubernetes documentation. This explains the retry (i.e. backoff) behavior in more detail.</p> <p>It is important to note that due to the configuration of the <code>MPIJob</code> and/or <code>Job</code> that is created for User Containers, the container retries are immediate - there is no backoff timeout between retires. This is due to the NNF Software setting the <code>RestartPolicy</code> to <code>Never</code>, which causes a new pod to spin up after every failure rather than re-use (i.e. restart) the previously failed pod. This allows a user to see a complete history of the failed pod(s) and the logs can easily be obtained. See more on this at Handling Pod and container failures in the Kubernetes documentation.</p>"},{"location":"guides/user-containers/readme/#putting-it-all-together","title":"Putting it All Together","text":"<p>See the NNF Container Example for a working example of how to run a simple MPI application inside of an NNF User Container and run it through a Container Workflow.</p>"},{"location":"guides/user-containers/readme/#reference","title":"Reference","text":""},{"location":"guides/user-containers/readme/#environment-variables","title":"Environment Variables","text":"<p>Two sets of environment variables are available with container workflows: Container and Compute Node. The former are the variables that are available inside the user containers. The latter are the variables that are provided back to the DWS workflow, which in turn are collected by the WLM and provided to compute nodes. See the WLM documentation for more details.</p>"},{"location":"guides/user-containers/readme/#container-environment-variables","title":"Container Environment Variables","text":"<p>These variables are provided for use inside the container. They can be used as part of the container command in the NNF Container Profile or within the container itself.</p>"},{"location":"guides/user-containers/readme/#storages","title":"Storages","text":"<p>Each storage defined by a container profile and used in a container workflow results in a corresponding environment variable. This variable is used to hold the mount directory of the filesystem.</p>"},{"location":"guides/user-containers/readme/#gfs2-index-mounts","title":"GFS2 Index Mounts","text":"<p>When using a GFS2 file system, each compute is allocated its own NNF volume. The NNF software mounts a collection of directories that are indexed (e.g. <code>0/</code>, <code>1/</code>, etc) to the compute nodes.</p> <p>Application authors must be aware that their desired GFS2 mount-point really a collection of directories, one for each compute node. It is the responsibility of the author to understand the underlying filesystem mounted at the storage environment variable (e.g. <code>$DW_JOB_my_gfs2_storage</code>).</p> <p>Each compute node's application can leave breadcrumbs (e.g. hostnames) somewhere on the GFS2 filesystem mounted on the compute node. This can be used to identify the index mount directory to a compute node from the application running inside of the user container.</p> <p>Here is an example of 3 compute nodes on an NNF node targeted in a GFS2 workflow:</p> <pre><code>$ ls $DW_JOB_my_gfs2_storage/*\n/mnt/nnf/3e92c060-ca0e-4ddb-905b-3d24137cbff4-0/0\n/mnt/nnf/3e92c060-ca0e-4ddb-905b-3d24137cbff4-0/1\n/mnt/nnf/3e92c060-ca0e-4ddb-905b-3d24137cbff4-0/2\n</code></pre> <p>Node positions are not absolute locations. The WLM could, in theory, select 6 physical compute nodes at physical location 1, 2, 3, 5, 8, 13, which would appear as directories <code>/0</code> through <code>/5</code> in the container mount path.</p> <p>Additionally, not all container instances could see the same number of compute nodes in an indexed-mount scenario. If 17 compute nodes are required for the job, WLM may assign 16 nodes to run one NNF node, and 1 node to another NNF. The first NNF node would have 16 index directories, whereas the 2nd would only contain 1.</p>"},{"location":"guides/user-containers/readme/#hostnames-and-domains","title":"Hostnames and Domains","text":"<p>Containers can contact one another via Kubernetes cluster networking. This functionality is provided by DNS. Environment variables are provided that allow a user to be able to piece together the FQDN so that the other containers can be contacted.</p> <p>This example demonstrates an MPI container workflow, with two worker pods. Two worker pods means two pods/containers running on two NNF nodes.</p>"},{"location":"guides/user-containers/readme/#ports","title":"Ports","text":"<p>See the <code>NNF_CONTAINER_PORTS</code> section under Compute Node Environment Variables.</p> <pre><code>mpiuser@my-container-workflow-launcher:~$ env | grep NNF\nNNF_CONTAINER_HOSTNAMES=my-container-workflow-launcher my-container-workflow-worker-0 my-container-workflow-worker-1\nNNF_CONTAINER_DOMAIN=default.svc.cluster.local\nNNF_CONTAINER_SUBDOMAIN=my-container-workflow-worker\n</code></pre> <p>The container FQDN consists of the following: <code>&lt;HOSTNAME&gt;.&lt;SUBDOMAIN&gt;.&lt;DOMAIN&gt;</code>. To contact the other worker container from worker 0, <code>my-container-workflow-worker-1.my-container-workflow-worker.default.svc.cluster.local</code> would be used.</p> <p>For MPI-based containers, an alternate way to retrieve this information is to look at the default <code>hostfile</code>, provided by <code>mpi-operator</code>. This file lists out all the worker nodes' FQDNs:</p> <pre><code>mpiuser@my-container-workflow-launcher:~$ cat /etc/mpi/hostfile\nmy-container-workflow-worker-0.my-container-workflow-worker.default.svc slots=1\nmy-container-workflow-worker-1.my-container-workflow-worker.default.svc slots=1\n</code></pre>"},{"location":"guides/user-containers/readme/#compute-node-environment-variables","title":"Compute Node Environment Variables","text":"<p>These environment variables are provided to the compute node via the WLM by way of the DWS Workflow. Note that these environment variables are consistent across all the compute nodes for a given workflow.</p> <p>Note</p> <p>It's important to note that the variables presented here pertain exclusively to User Container-related variables. This list does not encompass the entirety of NNF environment variables accessible to the compute node through the Workload Manager (WLM)</p>"},{"location":"guides/user-containers/readme/#nnf_container_ports","title":"<code>NNF_CONTAINER_PORTS</code>","text":"<p>If the NNF Container Profile requests container ports, then this environment variable provides the allocated ports for the container. This is a comma separated list of ports if multiple ports are requested.</p> <p>This allows an application on the compute node to contact the user container running on its local NNF node via these port numbers. The compute node must have proper routing to the NNF Node and needs a generic way of contacting the NNF node. It is suggested than a DNS entry is provided via <code>/etc/hosts</code>, or similar.</p> <p>For cases where one port is requested, the following can be used to contact the user container running on the NNF node (assuming a DNS entry for <code>local-rabbit</code> is provided via <code>/etc/hosts</code>).</p> <pre><code>local-rabbit:$(NNF_CONTAINER_PORTS)\n</code></pre>"},{"location":"guides/user-containers/readme/#creating-images","title":"Creating Images","text":"<p>For details, refer to the NNF Container Example Readme. However, in broad terms, an image that is capable of supporting MPI necessitates the following components:</p> <ul> <li>User Application: Your specific application</li> <li>Open MPI: Incorporate Open MPI to facilitate MPI operations</li> <li>SSH Server: Including an SSH server to enable communication</li> <li>nslookup: To validate Launcher/Worker container communication over the network</li> </ul> <p>By ensuring the presence of these components, users can create an image that supports MPI operations on the NNF platform.</p> <p>The nnf-mfu image serves as a suitable base image, encompassing all the essential components required for this purpose.</p>"},{"location":"guides/user-containers/readme/#using-a-private-container-repository","title":"Using a Private Container Repository","text":"<p>The user's containerized application may be placed in a private repository . In this case, the user must define an access token to be used with that repository, and that token must be made available to the Rabbit's Kubernetes environment so that it can pull that container from the private repository.</p> <p>See Pull an Image from a Private Registry in the Kubernetes documentation for more information.</p>"},{"location":"guides/user-containers/readme/#about-the-example","title":"About the Example","text":"<p>Each container registry will have its own way of letting its users create tokens to be used with their repositories . Docker Hub will be used for the private repository in this example, and the user's account on Docker Hub will be \"dean\".</p>"},{"location":"guides/user-containers/readme/#preparing-the-private-repository","title":"Preparing the Private Repository","text":"<p>The user's application container is named \"red-rock-slushy\" . To store this container on Docker Hub the user must log into docker.com with their browser and click the \"Create repository\" button to create a repository named \"red-rock-slushy\", and the user must check the box that marks the repository as private . The repository's name will be displayed as \"dean/red-rock-slushy\" with a lock icon to show that it is private.</p>"},{"location":"guides/user-containers/readme/#create-and-push-a-container","title":"Create and Push a Container","text":"<p>The user will create their container image in the usual ways, naming it for their private repository and tagging it according to its release.</p> <p>Prior to pushing images to the repository, the user must complete a one-time login to the Docker registry using the docker command-line tool.</p> <pre><code>docker login -u dean\n</code></pre> <p>After completing the login, the user may then push their images to the repository.</p> <pre><code>docker push dean/red-rock-slushy:v1.0\n</code></pre>"},{"location":"guides/user-containers/readme/#generate-a-read-only-token","title":"Generate a Read-Only Token","text":"<p>A read-only token must be generated to allow Kubernetes to pull that container image from the private repository, because Kubernetes will not be running as that user . This token must be given to the administrator, who will use it to create a Kubernetes secret.</p> <p>To log in and generate a read-only token to share with the administrator, the user must follow these steps:</p> <ul> <li>Visit docker.com and log in using their browser.</li> <li>Click on the username in the upper right corner.</li> <li>Select \"Account Settings\" and navigate to \"Security\".</li> <li>Click the \"New Access Token\" button to create a read-only token.</li> <li>Keep a copy of the generated token to share with the administrator.</li> </ul>"},{"location":"guides/user-containers/readme/#store-the-read-only-token-as-a-kubernetes-secret","title":"Store the Read-Only Token as a Kubernetes Secret","text":"<p>The administrator must store the user's read-only token as a kubernetes secret . The secret must be placed in the <code>default</code> namespace, which is the same namespace where the user containers will be run . The secret must include the user's Docker Hub username and the email address they have associated with that username . In this case, the secret will be named <code>readonly-red-rock-slushy</code>.</p> <pre><code>USER_TOKEN=users-token-text\nUSER_NAME=dean\nUSER_EMAIL=dean@myco.com\nSECRET_NAME=readonly-red-rock-slushy\nkubectl create secret docker-registry $SECRET_NAME -n default --docker-server=\"https://index.docker.io/v1/\" --docker-username=$USER_NAME --docker-password=$USER_TOKEN --docker-email=$USER_EMAIL\n</code></pre>"},{"location":"guides/user-containers/readme/#add-the-secret-to-the-nnfcontainerprofile","title":"Add the Secret to the NnfContainerProfile","text":"<p>The administrator must add an <code>imagePullSecrets</code> list to the NnfContainerProfile resource that was created for this user's containerized application.</p> <p>The following profile shows the placement of the <code>readonly-red-rock-slushy</code> secret which was created in the previous step, and points to the user's <code>dean/red-rock-slushy:v1.0</code> container.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfContainerProfile\nmetadata:\n  name: red-rock-slushy\n  namespace: nnf-system\ndata:\n  pinned: false\n  retryLimit: 6\n  spec:\n    imagePullSecrets:\n    - name: readonly-red-rock-slushy\n    containers:\n    - command:\n      - /users-application\n      image: dean/red-rock-slushy:v1.0\n      name: red-rock-app\n  storages:\n  - name: DW_JOB_foo_local_storage\n    optional: false\n  - name: DW_PERSISTENT_foo_persistent_storage\n    optional: true\n</code></pre> <p>Now any user can select this profile in their Workflow by specifying it in a <code>#DW container</code> directive.</p> <pre><code>#DW container profile=red-rock-slushy  [...]\n</code></pre>"},{"location":"guides/user-containers/readme/#using-a-private-container-repository-for-mpi-application-containers","title":"Using a Private Container Repository for MPI Application Containers","text":"<p>If our user's containerized application instead contains an MPI application, because perhaps it's a private copy of nnf-mfu, then the administrator would insert two <code>imagePullSecrets</code> lists into the <code>mpiSpec</code> of the NnfContainerProfile for the MPI launcher and the MPI worker.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfContainerProfile\nmetadata:\n  name: mpi-red-rock-slushy\n  namespace: nnf-system\ndata:\n  mpiSpec:\n    mpiImplementation: OpenMPI\n    mpiReplicaSpecs:\n      Launcher:\n        template:\n          spec:\n            imagePullSecrets:\n            - name: readonly-red-rock-slushy\n            containers:\n            - command:\n              - mpirun\n              - dcmp\n              - $(DW_JOB_foo_local_storage)/0\n              - $(DW_JOB_foo_local_storage)/1\n              image: dean/red-rock-slushy:v2.0\n              name: red-rock-launcher\n      Worker:\n        template:\n          spec:\n            imagePullSecrets:\n            - name: readonly-red-rock-slushy\n            containers:\n            - image: dean/red-rock-slushy:v2.0\n              name: red-rock-worker\n    runPolicy:\n      cleanPodPolicy: Running\n      suspend: false\n    slotsPerWorker: 1\n    sshAuthMountPath: /root/.ssh\n  pinned: false\n  retryLimit: 6\n  storages:\n  - name: DW_JOB_foo_local_storage\n    optional: false\n  - name: DW_PERSISTENT_foo_persistent_storage\n    optional: true\n</code></pre> <p>Now any user can select this profile in their Workflow by specifying it in a <code>#DW container</code> directive.</p> <pre><code>#DW container profile=mpi-red-rock-slushy  [...]\n</code></pre>"},{"location":"guides/user-interactions/readme/","title":"Rabbit User Interactions","text":""},{"location":"guides/user-interactions/readme/#overview","title":"Overview","text":"<p>A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form <code>#DW [command] [command args]</code>, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes.</p> <p>Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information.</p>"},{"location":"guides/user-interactions/readme/#commands","title":"Commands","text":""},{"location":"guides/user-interactions/readme/#jobdw","title":"jobdw","text":"<p>The <code>jobdw</code> directive command tells the Rabbit software to create a file system on the Rabbit hardware for the lifetime of the user's job. At the end of the job, any data that is not moved off of the file system either by the application or through a <code>copy_out</code> directive will be lost. Multiple <code>jobdw</code> directives can be listed in the same job script. </p>"},{"location":"guides/user-interactions/readme/#command-arguments","title":"Command Arguments","text":"Argument Required Value Notes <code>type</code> Yes <code>raw</code>, <code>xfs</code>, <code>gfs2</code>, <code>lustre</code> Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. <code>capacity</code> Yes Allocation size with units. <code>1TiB</code>, <code>100GB</code>, etc. Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: <code>KB</code>, <code>KiB</code>, <code>MB</code>, <code>MiB</code>, <code>GB</code>, <code>GiB</code>, <code>TB</code>, <code>TiB</code> <code>name</code> Yes String including numbers and '-' This is a name for the storage allocation that is unique within a job <code>profile</code> No Profile name This specifies which profile to use when allocating storage. Profiles include <code>mkfs</code> and <code>mount</code> arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. More information about storage profiles can be found in the Storage Profiles guide. <code>requires</code> No <code>copy-offload</code> Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application. See the Required Daemons section of the Directive Breakdown guide for a description of how the user may request the daemon, in the case where the WLM will run it only on demand."},{"location":"guides/user-interactions/readme/#examples","title":"Examples","text":"<pre><code>#DW jobdw type=xfs capacity=10GiB name=scratch\n</code></pre> <p>This directive results in a 10GiB xfs file system created for each compute node in the job using the default storage profile.</p> <pre><code>#DW jobdw type=lustre capacity=1TB name=dw-temp profile=high-metadata\n</code></pre> <p>This directive results in a single 1TB Lustre file system being created that can be accessed from all the compute nodes in the job. It is using a storage profile that an admin created to give high Lustre metadata performance.</p> <pre><code>#DW jobdw type=gfs2 capacity=50GB name=checkpoint requires=copy-offload\n</code></pre> <p>This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running using the Copy Offload API.</p>"},{"location":"guides/user-interactions/readme/#create_persistent","title":"create_persistent","text":"<p>The <code>create_persistent</code> command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single <code>create_persistent</code> directive is allowed in a job, and it cannot be in the same job as a <code>destroy_persistent</code> directive. See persistentdw to utilize the storage in a job.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_1","title":"Command Arguments","text":"Argument Required Value Notes <code>type</code> Yes <code>raw</code>, <code>xfs</code>, <code>gfs2</code>, <code>lustre</code> Type defines how the storage should be formatted. For Lustre file systems, a single file system is created. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node in the job. <code>capacity</code> Yes Allocation size with units. <code>1TiB</code>, <code>100GB</code>, etc. Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: <code>KB</code>, <code>KiB</code>, <code>MB</code>, <code>MiB</code>, <code>GB</code>, <code>GiB</code>, <code>TB</code>, <code>TiB</code> <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the storage allocation that is unique within the system <code>profile</code> No Profile name This specifies which profile to use when allocating storage. Profiles include <code>mkfs</code> and <code>mount</code> arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. The profile used when creating the persistent storage allocation is the same profile used by jobs that use the persistent storage. More information about storage profiles can be found in the Storage Profiles guide."},{"location":"guides/user-interactions/readme/#examples_1","title":"Examples","text":"<pre><code>#DW create_persistent type=xfs capacity=100GiB name=scratch\n</code></pre> <p>This directive results in a 100GiB xfs file system created for each compute node in the job using the default storage profile. Since xfs file systems are not network accessible, subsequent jobs that want to use the file system must have the same number of compute nodes, and be scheduled on compute nodes with access to the correct Rabbit nodes. This means the job with the <code>create_persistent</code> directive must schedule the desired number of compute nodes even if no application is run on the compute nodes as part of the job.</p> <pre><code>#DW create_persistent type=lustre capacity=10TiB name=shared-data profile=read-only\n</code></pre> <p>This directive results in a single 10TiB Lustre file system being created that can be accessed later by any compute nodes in the system. Multiple jobs can access a Rabbit Lustre file system at the same time. This job can be scheduled with a single compute node (or zero compute nodes if the WLM allows), without any limitations on compute node counts for subsequent jobs using the persistent Lustre file system.</p>"},{"location":"guides/user-interactions/readme/#destroy_persistent","title":"destroy_persistent","text":"<p>The <code>destroy_persistent</code> command will delete persistent storage that was allocated by a corresponding <code>create_persistent</code>. If the persistent storage is currently in use by a job, then the job containing the <code>destroy_persistent</code> command will fail. Only a single <code>destroy_persistent</code> directive is allowed in a job, and it cannot be in the same job as a <code>create_persistent</code> directive.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_2","title":"Command Arguments","text":"Argument Required Value Notes <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the persistent storage allocation that will be destroyed"},{"location":"guides/user-interactions/readme/#examples_2","title":"Examples","text":"<pre><code>#DW destroy_persistent name=shared-data\n</code></pre> <p>This directive will delete the persistent storage allocation with the name <code>shared-data</code></p>"},{"location":"guides/user-interactions/readme/#persistentdw","title":"persistentdw","text":"<p>The <code>persistentdw</code> command makes an existing persistent storage allocation available to a job. The persistent storage must already be created from a <code>create_persistent</code> command in a different job script. Multiple <code>persistentdw</code> commands can be used in the same job script to request access to multiple persistent allocations.</p> <p>Persistent Lustre file systems can be accessed from any compute nodes in the system, and the compute node count for the job can vary as needed. Multiple jobs can access a persistent Lustre file system concurrently if desired. Raw, xfs, and GFS2 file systems can only be accessed by compute nodes that have a physical connection to the Rabbits hosting the storage, and jobs accessing these storage types must have the same compute node count as the job that made the persistent storage.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_3","title":"Command Arguments","text":"Argument Required Value Notes <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the persistent storage that will be accessed <code>requires</code> No <code>copy-offload</code> Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application. See the Required Daemons section of the Directive Breakdown guide for a description of how the user may request the daemon, in the case where the WLM will run it only on demand."},{"location":"guides/user-interactions/readme/#examples_3","title":"Examples","text":"<pre><code>#DW persistentdw name=shared-data requires=copy-offload\n</code></pre> <p>This directive will cause the <code>shared-data</code> persistent storage allocation to be mounted onto the compute nodes for the job application to use. The copy-offload daemon will be started on the compute nodes so the application can request data movement during the application run.</p>"},{"location":"guides/user-interactions/readme/#copy_incopy_out","title":"copy_in/copy_out","text":"<p>The <code>copy_in</code> and <code>copy_out</code> directives are used to move data to and from the storage allocations on Rabbit nodes. The <code>copy_in</code> directive requests that data be moved into the Rabbit file system before application launch, and the <code>copy_out</code> directive requests data to be moved off of the Rabbit file system after application exit. This is different from data-movement that is requested through the copy-offload API, which occurs during application runtime. Multiple <code>copy_in</code> and <code>copy_out</code> directives can be included in the same job script. More information about data movement can be found in the Data Movement documentation.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_4","title":"Command Arguments","text":"Argument Required Value Notes <code>source</code> Yes <code>[path]</code>, <code>$DW_JOB_[name]/[path]</code>, <code>$DW_PERSISTENT_[name]/[path]</code> <code>[name]</code> is the name of the Rabbit persistent or job storage as specified in the <code>name</code> argument of the <code>jobdw</code> or <code>persistentdw</code> directive. Any <code>'-'</code> in the name from the <code>jobdw</code> or <code>persistentdw</code> directive should be changed to a <code>'_'</code> in the <code>copy_in</code> and <code>copy_out</code> directive. <code>destination</code> Yes <code>[path]</code>, <code>$DW_JOB_[name]/[path]</code>, <code>$DW_PERSISTENT_[name]/[path]</code> <code>[name]</code> is the name of the Rabbit persistent or job storage as specified in the <code>name</code> argument of the <code>jobdw</code> or <code>persistentdw</code> directive. Any <code>'-'</code> in the name from the <code>jobdw</code> or <code>persistentdw</code> directive should be changed to a <code>'_'</code> in the <code>copy_in</code> and <code>copy_out</code> directive. <code>profile</code> No Profile name This specifies which profile to use when copying data. Profiles specify the copy command to use, MPI arguments, and how output gets logged. If no profile is specified then the default profile is used. Profiles are created by an admin."},{"location":"guides/user-interactions/readme/#examples_4","title":"Examples","text":"<pre><code>#DW jobdw type=xfs capacity=10GiB name=fast-storage\n#DW copy_in source=/lus/backup/johndoe/important_data destination=$DW_JOB_fast_storage/data\n</code></pre> <p>This set of directives creates an xfs file system on the Rabbits for each compute node in the job, and then moves data from <code>/lus/backup/johndoe/important_data</code> to each of the xfs file systems. <code>/lus/backup</code> must be set up in the Rabbit software as a Global Lustre file system by an admin. The copy takes place before the application is launched on the compute nodes.</p> <pre><code>#DW persistentdw name=shared-data1\n#DW persistentdw name=shared-data2\n\n#DW copy_out source=$DW_PERSISTENT_shared_data1/a destination=$DW_PERSISTENT_shared_data2/a profile=no-xattr\n#DW copy_out source=$DW_PERSISTENT_shared_data1/b destination=$DW_PERSISTENT_shared_data2/b profile=no-xattr\n</code></pre> <p>This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the <code>no-xattr</code> profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a deterministic order.</p> <pre><code>#DW persistentdw name=shared-data\n#DW jobdw type=lustre capacity=1TiB name=fast-storage profile=high-metadata\n\n#DW copy_in source=/lus/shared/johndoe/shared-libraries destination=$DW_JOB_fast_storage/libraries\n#DW copy_in source=$DW_PERSISTENT_shared_data/ destination=$DW_JOB_fast_storage/data\n\n#DW copy_out source=$DW_JOB_fast_storage/data destination=/lus/backup/johndoe/very_important_data profile=no-xattr\n</code></pre> <p>This set of directives makes use of a persistent storage allocation and a job storage allocation. There are two <code>copy_in</code> directives, one that copies data from the global lustre file system to the job allocation, and another that copies data from the persistent allocation to the job allocation. These copies do not occur in a deterministic order. The <code>copy_out</code> directive occurs after the application has exited, and copies data from the Rabbit job storage to a global lustre file system.</p>"},{"location":"guides/user-interactions/readme/#container","title":"container","text":"<p>The <code>container</code> directive is used to launch user containers on the Rabbit nodes. The containers have access to <code>jobdw</code>, <code>persistentdw</code>, or global Lustre storage as specified in the <code>container</code> directive. More documentation for user containers can be found in the User Containers guide. Only a single <code>container</code> directive is allowed in a job.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_5","title":"Command Arguments","text":"Argument Required Value Notes <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the container instance that is unique within a job <code>profile</code> Yes Profile name This specifies which container profile to use. The container profile contains information about which container to run, which file system types to expect, which network ports are needed, and many other options. An admin is responsible for creating the container profiles. <code>DW_JOB_[expected]</code> No <code>jobdw</code> storage allocation <code>name</code> The container profile will list <code>jobdw</code> file systems that the container requires. <code>[expected]</code> is the name as specified in the container profile <code>DW_PERSISTENT_[expected]</code> No <code>persistentdw</code> storage allocation <code>name</code> The container profile will list <code>persistentdw</code> file systems that the container requires. <code>[expected]</code> is the name as specified in the container profile <code>DW_GLOBAL_[expected]</code> No Global lustre path The container profile will list global Lustre file systems that the container requires. <code>[expected]</code> is the name as specified in the container profile"},{"location":"guides/user-interactions/readme/#examples_5","title":"Examples","text":"<pre><code>#DW jobdw type=xfs capacity=10GiB name=fast-storage\n#DW container name=backup profile=automatic-backup DW_JOB_source=fast-storage DW_GLOBAL_destination=/lus/backup/johndoe\n</code></pre> <p>These directives create an xfs Rabbit job allocation and specify a container that should run on the Rabbit nodes. The container profile specified two file systems that the container needs, <code>DW_JOB_source</code> and <code>DW_GLOBAL_destination</code>. <code>DW_JOB_source</code> requires a <code>jobdw</code> file system and <code>DW_GLOBAL_destination</code> requires a global Lustre file system. </p>"},{"location":"guides/user-interactions/readme/#environment-variables","title":"Environment Variables","text":"<p>The WLM makes a set of environment variables available to the job application running on the compute nodes that provide Rabbit specific information. These environment variables are used to find the mount location of Rabbit file systems and port numbers for user containers.</p> Environment Variable Value Notes <code>DW_JOB_[name]</code> Mount path of a <code>jobdw</code> file system <code>[name]</code> is from the <code>name</code> argument in the <code>jobdw</code> directive. Any <code>'-'</code> characters in the <code>name</code> will be converted to <code>'_'</code> in the environment variable. There will be one of these environment variables per <code>jobdw</code> directive in the job. <code>DW_PERSISTENT_[name]</code> Mount path of a <code>persistentdw</code> file system <code>[name]</code> is from the <code>name</code> argument in the <code>persistentdw</code> directive. Any <code>'-'</code> characters in the <code>name</code> will be converted to <code>'_'</code> in the environment variable. There will be one of these environment variables per <code>persistentdw</code> directive in the job. <code>NNF_CONTAINER_PORTS</code> Comma separated list of ports These ports are used together with the IP address of the local Rabbit to communicate with a user container specified by a <code>container</code> directive. More information can be found in the User Containers guide."},{"location":"repo-guides/readme/","title":"Repo Guides","text":""},{"location":"repo-guides/readme/#management","title":"Management","text":"<ul> <li>Releasing NNF Software</li> </ul>"},{"location":"repo-guides/release-nnf-sw/readme/","title":"Releasing NNF Software","text":""},{"location":"repo-guides/release-nnf-sw/readme/#nnf-software-overview","title":"NNF Software Overview","text":"<p>The following repositories comprise the NNF Software and each have their own versions. There is a hierarchy, since <code>nnf-deploy</code> packages the individual components together using submodules.</p> <p>Each component under <code>nnf-deploy</code> needs to be released first, then <code>nnf-deploy</code> can be updated to point to those release versions, then <code>nnf-deploy</code> itself can be updated and released.</p> <p>The documentation repo (NearNodeFlash/NearNodeFlash.github.io) is released separately and is not part of <code>nnf-deploy</code>, but it should match the version number of <code>nnf-deploy</code>. Release this like the other components.</p> <ul> <li> <p>NearNodeFlash/nnf-deploy</p> <ul> <li>DataWorkflowServices/dws</li> <li>HewlettPackard/lustre-csi-driver</li> <li>NearNodeFlash/lustre-fs-operator</li> <li>NearNodeFlash/nnf-mfu</li> <li>NearNodeFlash/nnf-sos</li> <li>NearNodeFlash/nnf-dm</li> <li>NearNodeFlash/nnf-integration-test</li> </ul> </li> <li> <p>NearNodeFlash/NearNodeFlash.github.io</p> </li> </ul> <p>nnf-ec is vendored in as part of <code>nnf-sos</code> and does not need to be released separately.</p>"},{"location":"repo-guides/release-nnf-sw/readme/#primer","title":"Primer","text":"<p>This document is based on the process set forth by the DataWorkflowServices Release Process. Please read that as a background for this document before going any further.</p>"},{"location":"repo-guides/release-nnf-sw/readme/#requirements","title":"Requirements","text":"<p>To create tags and releases, you will need maintainer or admin rights on the repos.</p>"},{"location":"repo-guides/release-nnf-sw/readme/#release-each-component-in-nnf-deploy","title":"Release Each Component In <code>nnf-deploy</code>","text":"<p>You'll first need to create releases for each component contained in <code>nnf-deploy</code>. This section describes that process.</p> <p>Each release branch needs to be updated with what is on master. To do that, we'll need the latest copy of master, and it will ultimately be merged to the <code>releases/v0</code> branch via a Pull Request. Once merged, an annotated tag is created and then a release.</p> <p>Each component has its own version number that needs to be incremented. Make sure you change the version numbers in the commands below to match the new version for the component. The <code>v0.0.3</code> is just an example.</p> <ol> <li> <p>Ensure your branches are up to date:</p> <pre><code>git checkout master\ngit pull\ngit checkout releases/v0\ngit pull\n</code></pre> </li> <li> <p>Create a branch to merge into the release branch:</p> <pre><code>git checkout -b release-v0.0.3\n</code></pre> </li> <li> <p>Merge in the updates from the <code>master</code> branch. There should not be any conflicts, but it's    not unheard of. Tread carefully if there are conflicts.</p> <pre><code>git merge master\n</code></pre> </li> <li> <p>Verify that there are no differences between your branch and the master branch:</p> <pre><code>git diff master\n</code></pre> <p>If there are any differences, they must be trivial. Some READMEs may have extra lines at the end.</p> </li> <li> <p>Perform repo-specific updates:</p> <ol> <li>For <code>lustre-csi-driver</code>, <code>lustre-fs-operator</code>, <code>dws</code>, <code>nnf-sos</code>, and <code>nnf-dm</code> there are additional files that need to track the version number as well, which allow them to be installed with <code>kubectl apply -k</code>.</li> </ol> Repo Update <code>nnf-mfu</code> The new version of <code>nnf-mfu</code> is referenced by the <code>NNFMFU</code> variable in several places:<code>nnf-sos</code>1. <code>Makefile</code> replace <code>NNFMFU</code> with <code>nnf-mfu's</code> tag.<code>nnf-dm</code>1. In <code>Dockerfile</code> and <code>Makefile</code>, replace <code>NNFMFU_VERSION</code> with the new version.2. In <code>config/manager/kustomization.yaml</code>, replace <code>nnf-mfu</code>'s <code>newTag: &lt;X.Y.Z&gt;.</code><code>nnf-deploy</code>1. In <code>config/repositories.yaml</code> replace <code>NNFMFU_VERSION</code> with the new version. <code>lustre-fs-operator</code> update <code>config/manager/kustomization.yaml</code> with the correct version.<code>nnf-deploy</code>1. In <code>config/repositories.yaml</code> replace the lustre-fs-operator version. <code>dws</code> update <code>config/manager/kustomization.yaml</code> with the correct version. <code>nnf-sos</code> update <code>config/manager/kustomization.yaml</code> with the correct version. <code>nnf-dm</code> update <code>config/manager/kustomization.yaml</code> with the correct version. <code>lustre-csi-driver</code> update <code>deploy/kubernetes/base/kustomization.yaml</code> and <code>charts/lustre-csi-driver/values.yaml</code> with the correct version.<code>nnf-deploy</code>1. In <code>config/repositories.yaml</code> replace the lustre-csi-driver version. </li> <li> <p>Target the <code>releases/v0</code> branch with a Pull Request from your branch.  When merging the Pull Request, you must use a Merge Commit.</p> <p>Note</p> <p>Do not Rebase or Squash! Those actions remove the records that Git uses to determine which commits have been merged, and then when the next release is created Git will treat everything like a conflict. Additionally, this will cause auto-generated release notes to include the previous release.</p> </li> <li> <p>Once merged, update the release branch locally and create an annotated tag. Each repo has a workflow job named <code>create_release</code> that will create a release automatically when the new tag is pushed.</p> <pre><code>git checkout releases/v0\ngit pull\ngit tag -a v0.0.3 -m \"Release v0.0.3\"\ngit push origin --tags\n</code></pre> </li> <li> <p>GOTO Step 1 and repeat this process for each remaining component.</p> </li> </ol>"},{"location":"repo-guides/release-nnf-sw/readme/#release-nnf-deploy","title":"Release <code>nnf-deploy</code>","text":"<p>Once the individual components are released, we need to update the submodules in <code>nnf-deploy's</code> <code>master</code> branch before we create the release branch. This ensures that everything is current on <code>master</code> for <code>nnf-deploy</code>.</p> <ol> <li> <p>Update the submodules for <code>nnf-deploy</code> on master:</p> <pre><code>cd nnf-deploy\ngit checkout master\ngit pull\ngit submodule foreach git checkout master\ngit submodule foreach git pull\n</code></pre> </li> <li> <p>Create a branch to capture the submodule changes for the PR to <code>master</code></p> <pre><code>git checkout -b update-submodules\n</code></pre> </li> <li> <p>Commit the changes and open a Pull Request against the <code>master</code> branch.</p> </li> <li> <p>Once merged, follow steps 1-3 from the previous section to create a release branch off of <code>releases/v0</code> and    update it with changes from <code>master</code>.</p> </li> <li> <p>There will be conflicts for the submodules after step 3. This is expected. Update the    submodules to the new tags and then commit the changes.  If each tag was committed properly, the    following command can do this for you:</p> <pre><code>git submodule foreach 'git checkout `git describe --match=\"v*\" HEAD`'\n</code></pre> </li> <li> <p>Add each submodule to the commit with <code>git add</code>.</p> </li> <li> <p>Verify that each submodule is now at the proper tagged version.</p> <pre><code>git submodule\n</code></pre> </li> <li> <p>Update <code>config/repositories.yaml</code> with the referenced versions for:</p> <ol> <li><code>lustre-csi-driver</code></li> <li><code>lustre-fs-operator</code></li> <li><code>nnf-mfu</code>  (Search for NNFMFU_VERSION)</li> </ol> </li> <li> <p>Tidy and make <code>nnf-deploy</code> to avoid embarrassment.</p> <pre><code>go mod tidy\nmake\n</code></pre> </li> <li> <p>Do another <code>git add</code> for any changes, particularly <code>go.mod</code> and/or <code>go.sum</code>.</p> </li> <li> <p>Verify that <code>git status</code> is happy with <code>nnf-deploy</code> and then finalize the merge     from master by with a <code>git commit</code>.</p> </li> <li> <p>Follow steps 6-7 from the previous section to finalize the release of <code>nnf-deploy</code>.</p> </li> </ol>"},{"location":"repo-guides/release-nnf-sw/readme/#release-nearnodeflashgithubio","title":"Release <code>NearNodeFlash.github.io</code>","text":"<p>Please review and update the documentation for changes you may have made.</p> <p>After nnf-deploy has a release tag, you may release the documentation. Use the same steps found above in \"Release Each Component\". Note that the default branch for this repo is \"main\" instead of \"master\".</p> <p>Give this release a tag that matches the nnf-deploy release, to show that they go together. Create the release by using the \"Create release\" or \"Draft a new release\" button in the GUI, or by using the <code>gh release create</code> CLI command. Whether using the GUI or the CLI, mark the release as \"latest\" and select the appropriate option to generate release notes.</p> <p>Wait for the <code>mike</code> tool in <code>.github/workflow/release.yaml</code> to finish building the new doc. You can check its status by going to the <code>gh-pages</code> branch in the repo. When you visit the release at https://nearnodeflash.github.io, you should see the new release in the drop-down menu and the new release should be the default display.</p> <p>The software is now released!</p>"},{"location":"repo-guides/release-nnf-sw/readme/#clone-a-release","title":"Clone a release","text":"<p>The follow commands clone release <code>v0.0.7</code> into <code>nnf-deploy-v0.0.7</code></p> <pre><code>export NNF_VERSION=v0.0.7\n\ngit clone --recurse-submodules git@github.com:NearNodeFlash/nnf-deploy nnf-deploy-$NNF_VERSION\ncd nnf-deploy-$NNF_VERSION\ngit -c advice.detachedHead=false checkout $NNF_VERSION --recurse-submodules\n\ngit submodule status\n</code></pre>"},{"location":"rfcs/","title":"Request for Comment","text":"<ol> <li> <p>Rabbit Request For Comment Process  - Published</p> </li> <li> <p>Rabbit Storage For Containerized Applications  - Published</p> </li> </ol>"},{"location":"rfcs/0001/readme/","title":"Rabbit Request For Comment Process","text":"<p>Rabbit software must be designed in close collaboration with our end-users. Part of this process involves open discussion in the form of Request For Comment (RFC) documents. The remainder of this document presents the RFC process for Rabbit.</p>"},{"location":"rfcs/0001/readme/#history-philosophy","title":"History &amp; Philosophy","text":"<p>NNF RFC documents are modeled after the long history of IETF RFC documents that describe the internet. The philosophy is captured best in RFC 3</p> <p>The content of a [...] note may be any thought, suggestion, etc. related to the HOST software or other aspect of the network.  Notes are encouraged to be timely rather than polished.  Philosophical positions without examples or other specifics, specific suggestions or implementation techniques without introductory or background explication, and explicit questions without any attempted answers are all acceptable.  The minimum length for a [...] note is one sentence.</p> <p>These standards (or lack of them) are stated explicitly for two reasons. First, there is a tendency to view a written statement as ipso facto authoritative, and we hope to promote the exchange and discussion of considerably less than authoritative ideas.  Second, there is a natural hesitancy to publish something unpolished, and we hope to ease this inhibition.</p>"},{"location":"rfcs/0001/readme/#when-to-create-an-rfc","title":"When to Create an RFC","text":"<p>New features, improvements, and other tasks that need to source feedback from multiple sources are to be written as Request For Comment (RFC) documents.</p>"},{"location":"rfcs/0001/readme/#metadata","title":"Metadata","text":"<p>At the start of each RFC, there must include a short metadata block that contains information useful for filtering and sorting existing documents. This markdown is not visible inside the document.</p> <pre><code>---\nauthors: John Doe &lt;john.doe@company.com&gt;, Jane Doe &lt;jane.doe@company.com&gt;\nstate: prediscussion|ideation|discussion|published|committed|abandoned\ndiscussion: (link to PR, if available)\n----\n</code></pre>"},{"location":"rfcs/0001/readme/#creation","title":"Creation","text":"<p>An RFC should be created at the next freely available 4-digit index the GitHub RFC folder. Create a folder for your RFC and write your RFC document as <code>readme.md</code> using standard Markdown. Include additional documents or images in the folder if needed.</p> <p>Add an entry to <code>/docs/rfcs/index.md</code></p> <p>Add an entry to <code>/mkdocs.yml</code> in the <code>nav[RFCs]</code> section</p>"},{"location":"rfcs/0001/readme/#push","title":"Push","text":"<p>Push your changes to your RFC branch</p> <pre><code>git add --all\ngit commit -s -m \"[####]: Your Request For Comment Document\"\ngit push origin ####\n</code></pre>"},{"location":"rfcs/0001/readme/#pull-request","title":"Pull Request","text":"<p>Submit a PR for your branch. This will open your RFC to comments. Add those individuals who are interested in your RFC as reviewers.</p>"},{"location":"rfcs/0001/readme/#merge","title":"Merge","text":"<p>Once consensus has been reached on your RFC, merge to main origin. </p>"},{"location":"rfcs/0002/readme/","title":"Rabbit storage for containerized applications","text":"<p>Note</p> <p>This RFC contains outdated information. For the most up-to-date details, please refer to the User Containers documentation.</p> <p>For Rabbit to provide storage to a containerized application there needs to be some mechanism. The remainder of this RFC proposes that mechanism.</p>"},{"location":"rfcs/0002/readme/#actors","title":"Actors","text":"<p>There are several actors involved:</p> <ul> <li>The AUTHOR of the containerized application</li> <li>The ADMINISTRATOR who works with the author to determine the application requirements for execution</li> <li>The USER who intends to use the application using the 'container' directive in their job specification</li> <li>The RABBIT software that interprets the #DWs and starts the container during execution of the job</li> </ul> <p>There are multiple relationships between the actors:</p> <ul> <li>AUTHOR to ADMINISTRATOR: The author tells the administrator how their application is executed and the NNF storage requirements.</li> <li>Between the AUTHOR and USER: The application expects certain storage, and the #DW must meet those expectations.</li> <li>ADMINISTRATOR to RABBIT: Admin tells Rabbit how to run the containerized application with the required storage.</li> <li>Between USER and RABBIT: User provides the #DW container directive in the job specification. Rabbit validates and interprets the directive.</li> </ul>"},{"location":"rfcs/0002/readme/#proposal","title":"Proposal","text":"<p>The proposal below outlines the high level behavior of running containers in a workflow:</p> <ol> <li>The AUTHOR writes their application expecting NNF Storage at specific locations. For each storage requirement, they define:<ol> <li>a unique name for the storage which can be referenced in the 'container' directive</li> <li>the required mount path or mount path prefix</li> <li>other constraints or storage requirements (e.g. minimum capacity)</li> </ol> </li> <li>The AUTHOR works with the ADMINISTRATOR to define:<ol> <li>a unique name for the program to be referred by USER</li> <li>the pod template or MPI Job specification for executing their program</li> <li>the NNF storage requirements described above.</li> </ol> </li> <li>The ADMINISTRATOR creates a corresponding NNF Container Profile Kubernetes custom resource with the necessary NNF storage requirements and pod specification as described by the AUTHOR</li> <li>The USER who desires to use the application works with the AUTHOR and the related NNF Container Profile to understand the storage requirements</li> <li>The USER submits a WLM job with the #DW container directive variables populated</li> <li>WLM runs the workflow and drives it through the following stages...<ol> <li><code>Proposal</code>: RABBIT validates the #DW container directive by comparing the supplied values to those listed in the NNF Container Profile. If the workflow fails to meet the requirements, the job fails</li> <li><code>PreRun</code>: RABBIT software:<ol> <li>duplicates the pod template specification from the Container Profile and patches the necessary Volumes and the config map. The spec is used as the basis for starting the necessary pods and containers</li> <li>creates a config map reflecting the storage requirements and any runtime parameters; this is provided to the container at the volume mount named <code>nnf-config</code>, if specified</li> </ol> </li> <li>The containerized application(s) executes. The expected mounts are available per the requirements and celebration occurs. The pods continue to run until:</li> <li>a pod completes successfully (any failed pods will be retried)</li> <li>the max number of pod retries is hit (indicating failure on all retry attempts)<ol> <li>Note: retry limit is non-optional per Kubernetes configuration</li> <li>If retries are not desired, this number could be set to 0 to disable any retry attempts</li> </ol> </li> <li><code>PostRun</code>: RABBIT software:</li> <li>marks the stage as <code>Ready</code> if the pods have all completed successfully. This includes a successful retry after preceding failures</li> <li>starts a timer for any running pods. Once the timeout is hit, the pods will be killed and the workflow will indicate failure</li> <li>leaves all pods around for log inspection</li> </ol> </li> </ol>"},{"location":"rfcs/0002/readme/#container-assignment-to-rabbit-nodes","title":"Container Assignment to Rabbit Nodes","text":"<p>During <code>Proposal</code>, the USER must assign compute nodes for the container workflow. The assigned compute nodes determine which Rabbit nodes run the containers.</p>"},{"location":"rfcs/0002/readme/#container-definition","title":"Container Definition","text":"<p>Containers can be launched in two ways:</p> <ol> <li>MPI Jobs</li> <li>Non-MPI Jobs</li> </ol> <p>MPI Jobs are launched using <code>mpi-operator</code>. This uses a launcher/worker model. The launcher pod is responsible for running the <code>mpirun</code> command that will target the worker pods to run the MPI application. The launcher will run on the first targeted NNF node and the workers will run on each of the targeted NNF nodes.</p> <p>For Non-MPI jobs, <code>mpi-operator</code> is not used. This model runs the same application on each of the targeted NNF nodes.</p> <p>The NNF Container Profile allows a user to pick one of these methods. Each method is defined in similar, but different fashions. Since MPI Jobs use <code>mpi-operator</code>, the <code>MPIJobSpec</code> is used to define the container(s). For Non-MPI Jobs a <code>PodSpec</code> is used to define the container(s).</p> <p>An example of an MPI Job is below. The <code>data.mpiSpec</code> field is defined:</p> <pre><code>kind: NnfContainerProfile\napiVersion: nnf.cray.hpe.com/v1alpha1\ndata:\n  mpiSpec:\n    mpiReplicaSpecs:\n      Launcher:\n        template:\n          spec:\n            containers:\n            - command:\n              - mpirun\n              - dcmp\n              - $(DW_JOB_foo_local_storage)/0\n              - $(DW_JOB_foo_local_storage)/1\n              image: ghcr.io/nearnodeflash/nnf-mfu:latest\n              name: example-mpi\n      Worker:\n        template:\n          spec:\n            containers:\n            - image: ghcr.io/nearnodeflash/nnf-mfu:latest\n              name: example-mpi\n    slotsPerWorker: 1\n...\n</code></pre> <p>An example of a Non-MPI Job is below. The <code>data.spec</code> field is defined:</p> <pre><code>kind: NnfContainerProfile\napiVersion: nnf.cray.hpe.com/v1alpha1\ndata:\n  spec:\n    containers:\n    - command:\n      - /bin/sh\n      - -c\n      - while true; do date &amp;&amp; sleep 5; done\n      image: alpine:latest\n      name: example-forever\n...\n</code></pre> <p>In both cases, the <code>spec</code> is used as a starting point to define the containers. NNF software supplements the specification to add functionality (e.g. mounting #DW storages). In other words, what you see here will not be the final spec for the container that ends up running as part of the container workflow.</p>"},{"location":"rfcs/0002/readme/#security","title":"Security","text":"<p>The workflow's UID and GID are used to run the container application and for mounting the specified fileystems in the container. Kubernetes allows for a way to define permissions for a container using a Security Context.</p> <p><code>mpirun</code> uses <code>ssh</code> to communicate with the worker nodes. <code>ssh</code> requires that UID is assigned to a username. Since the UID/GID are dynamic values from the workflow, work must be done to the container's <code>/etc/passwd</code> to map the UID/GID to a username. An <code>InitContainer</code> is used to modify <code>/etc/passwd</code> and mount it into the container.</p>"},{"location":"rfcs/0002/readme/#communication-details","title":"Communication Details","text":"<p>The following subsections outline the proposed communication between the Rabbit nodes themselves and the Compute nodes.</p>"},{"location":"rfcs/0002/readme/#rabbit-to-rabbit-communication","title":"Rabbit-to-Rabbit Communication","text":""},{"location":"rfcs/0002/readme/#non-mpi-jobs","title":"Non-MPI Jobs","text":"<p>Each rabbit node can be reached via <code>&lt;hostname&gt;.&lt;subdomain&gt;</code> using DNS. The hostname is the Rabbit node name and the workflow name is used for the subdomain.</p> <p>For example, a workflow name of <code>foo</code> that targets <code>rabbit-node2</code> would be <code>rabbit-node2.foo</code>.</p> <p>Environment variables are provided to the container and ConfigMap for each rabbit that is targeted by the container workflow:</p> <pre><code>NNF_CONTAINER_NODES=rabbit-node2 rabbit-node3\nNNF_CONTAINER_SUBDOMAIN=foo\nNNF_CONTAINER_DOMAIN=default.svc.cluster.local\n</code></pre> <pre><code>kind: ConfigMap\napiVersion: v1\ndata:\n  nnfContainerNodes:\n    - rabbit-node2\n    - rabbit-node3\n  nnfContainerSubdomain: foo\n  nnfContainerDomain: default.svc.cluster.local\n</code></pre> <p>DNS can then be used to communicate with other Rabbit containers. The FQDN for the container running on rabbit-node2 is <code>rabbit-node2.foo.default.svc.cluster.local</code>.</p>"},{"location":"rfcs/0002/readme/#mpi-jobs","title":"MPI Jobs","text":"<p>For MPI Jobs, these hostnames and subdomains will be slightly different due to the implementation of <code>mpi-operator</code>. However, the variables will remain the same and provide a consistent way to retrieve the values.</p>"},{"location":"rfcs/0002/readme/#compute-to-rabbit-communication","title":"Compute-to-Rabbit Communication","text":"<p>For Compute to Rabbit communication, the proposal is to use an open port between the nodes, so the applications could communicate using IP protocol.  The port number would be assigned by the Rabbit software and included in the workflow resource's environmental variables after the Setup state (similar to workflow name &amp; namespace).  Flux should provide the port number to the compute application via an environmental variable or command line argument. The containerized application would always see the same port number using the <code>hostPort</code>/<code>containerPort</code> mapping functionality included in Kubernetes. To clarify, the Rabbit software is picking and managing the ports picked for <code>hostPort</code>.</p> <p>This requires a range of ports to be open in the firewall configuration and specified in the rabbit system configuration. The fewer the number of ports available increases the chances of a port reservation conflict that would fail a workflow.</p> <p>Example port range definition in the SystemConfiguration:</p> <pre><code>apiVersion: v1\nitems:\n  - apiVersion: dws.cray.hpe.com/v1alpha1\n    kind: SystemConfiguration\n      name: default\n      namespace: default\n    spec:\n      containerHostPortRangeMin: 30000\n      containerHostPortRangeMax: 40000\n      ...\n</code></pre>"},{"location":"rfcs/0002/readme/#example","title":"Example","text":"<p>For this example, let's assume I've authored an application called <code>foo</code>. This application requires Rabbit local GFS2 storage and a persistent Lustre storage volume.</p> <p>Working with an administrator, my application's storage requirements and pod specification are placed in an NNF Container Profile <code>foo</code>:</p> <pre><code>kind: NnfContainerProfile\napiVersion: v1alpha1\nmetadata:\n    name: foo\n    namespace: default\nspec:\n    postRunTimeout: 300\n    maxRetries: 6\n    storages:\n    - name: DW_JOB_foo-local-storage\n      optional: false\n    - name: DW_PERSISTENT_foo-persistent-storage\n      optional: false\n    spec:\n        containers:\n        - name: foo\n          image: foo:latest\n          command:\n          - /foo\n          ports:\n          - name: compute\n            containerPort: 80\n</code></pre> <p>Say Peter wants to use <code>foo</code> as part of his job specification. Peter would submit the job with the directives below:</p> <pre><code>#DW jobdw name=my-gfs2 type=gfs2 capacity=1TB\n\n#DW persistentdw name=some-lustre\n\n#DW container name=my-foo profile=foo                 \\\n    DW_JOB_foo-local-storage=my-gfs2                  \\\n    DW_PERSISTENT_foo-persistent-storage=some-lustre\n</code></pre> <p>Since the NNF Container Profile has specified that both storages are not optional (i.e. <code>optional: false</code>), they must both be present in the #DW directives along with the <code>container</code> directive. Alternatively, if either was marked as optional (i.e. <code>optional: true</code>), it would not be required to be present in the #DW directives and therefore would not be mounted into the container.</p> <p>Peter submits the job to the WLM. WLM guides the job through the workflow states:</p> <ol> <li>Proposal: Rabbit software verifies the #DW directives. For the container directive <code>my-foo</code> with profile <code>foo</code>, the storage requirements listed in the NNF Container Profile are <code>foo-local-storage</code> and <code>foo-persistent-storage</code>. These values are correctly represented by the directive so it is valid.</li> <li>Setup: Since there is a jobdw, <code>my-gfs2</code>, Rabbit software provisions this storage.</li> <li> <p>Pre-Run:</p> <ol> <li> <p>Rabbit software generates a config map that corresponds to the storage requirements and runtime parameters.</p> <pre><code>    kind: ConfigMap\n    apiVersion: v1\n    metadata:\n        name: my-job-container-my-foo\n    data:\n        DW_JOB_foo_local_storage:             mount-type=indexed-mount\n        DW_PERSISTENT_foo_persistent_storage: mount-type=mount-point\n        ...\n</code></pre> </li> <li> <p>Rabbit software creates a pod and duplicates the <code>foo</code> pod spec in the NNF Container Profile and fills in the necessary volumes and config map.</p> <pre><code>    kind: Pod\n    apiVersion: v1\n    metadata:\n        name: my-job-container-my-foo\n    template:\n        metadata:\n            name: foo\n            namespace: default\n        spec:\n            containers:\n            # This section unchanged from Container Profile\n            - name: foo\n              image: foo:latest\n              command:\n                - /foo\n              volumeMounts:\n              - name: foo-local-storage\n                mountPath: &lt;MOUNT_PATH&gt;\n              - name: foo-persistent-storage\n                mountPath: &lt;MOUNT_PATH&gt;\n              - name: nnf-config\n                mountPath: /nnf/config\n              ports:\n                - name: compute\n                  hostPort: 9376 # hostport selected by Rabbit software\n                  containerPort: 80\n\n            # volumes added by Rabbit software\n            volumes:\n            - name: foo-local-storage\n              hostPath:\n                path: /nnf/job/my-job/my-gfs2\n            - name: foo-persistent-storage\n              hostPath:\n                path: /nnf/persistent/some-lustre\n            - name: nnf-config\n              configMap:\n                name: my-job-container-my-foo\n\n            # securityContext added by Rabbit software - values will be inherited from the workflow\n            securityContext:\n              runAsUser: 1000\n              runAsGroup: 2000\n              fsGroup: 2000\n</code></pre> </li> <li> <p>Rabbit software starts the pods on Rabbit nodes</p> </li> <li>Post-Run</li> <li>Rabbit waits for all pods to finish (or until timeout is hit)</li> <li>If all pods are successful, Post-Run is marked as <code>Ready</code></li> <li>If any pod is not successful, Post-Run is not marked as <code>Ready</code></li> </ol> </li> </ol>"},{"location":"rfcs/0002/readme/#special-note-indexed-mount-type-for-gfs2-file-systems","title":"Special Note: Indexed-Mount Type for GFS2 File Systems","text":"<p>When using a GFS2 file system, each compute is allocated its own Rabbit volume. The Rabbit software mounts a collection of mount paths with a common prefix and an ending indexed value.</p> <p>Application AUTHORS must be aware that their desired mount-point really contains a collection of directories, one for each compute node. The mount point type can be known by consulting the config map values.</p> <p>If we continue the example from above, the <code>foo</code> application expects the foo-local-storage path of <code>/foo/local</code> to contain several directories</p> <pre><code>$ ls /foo/local/*\n\nnode-0\nnode-1\nnode-2\n...\nnode-N\n</code></pre> <p>Node positions are not absolute locations. WLM could, in theory, select 6 physical compute nodes at physical location 1, 2, 3, 5, 8, 13, which would appear as directories <code>/node-0</code> through <code>/node-5</code> in the container path.</p> <p>Symlinks will be added to support the physical compute node names. Assuming a compute node hostname of <code>compute-node-1</code> from the example above, it would link to <code>node-0</code>, <code>compute-node-2</code> would link to <code>node-1</code>, etc.</p> <p>Additionally, not all container instances could see the same number of compute nodes in an indexed-mount scenario. If 17 compute nodes are required for the job, WLM may assign 16 nodes to run one Rabbit, and 1 node to another Rabbit.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-,:!=\\[\\]()\"/]+|(?!\\b)(?=[A-Z][a-z])|\\.(?!\\d)|&[lg]t;","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Near Node Flash","text":"<p>Near Node Flash, also known as Rabbit, provides a disaggregated chassis-local storage solution which utilizes SR-IOV over a PCIe Gen 4.0 switching fabric to provide a set of compute blades with NVMe storage. It also provides a dedicated storage processor to offload tasks such as storage preparation and data movement from the compute nodes.</p> <p>Here you will find NNF User Guides, Examples, and Request For Comment (RFC) documents.</p>"},{"location":"guides/","title":"User Guides","text":""},{"location":"guides/#setup","title":"Setup","text":"<ul> <li>Initial Setup</li> <li>Compute Daemons</li> <li>Firmware Upgrade</li> <li>High Availability Cluster</li> <li>RBAC for Users</li> </ul>"},{"location":"guides/#provisioning","title":"Provisioning","text":"<ul> <li>Storage Profiles</li> <li>Data Movement Configuration</li> <li>Copy Offload API</li> <li>Lustre External MGT</li> <li>Global Lustre</li> <li>Directive Breakdown</li> <li>User Interactions</li> </ul>"},{"location":"guides/#nnf-user-containers","title":"NNF User Containers","text":"<ul> <li>User Containers</li> </ul>"},{"location":"guides/#node-management","title":"Node Management","text":"<ul> <li>Disable or Drain a Node</li> <li>Debugging NVMe Namespaces</li> </ul>"},{"location":"guides/compute-daemons/readme/","title":"Compute Daemons","text":"<p>Rabbit software requires two daemons be installed and run on each compute node. Each daemon shares similar build, package, and installation processes described below.</p> <ul> <li>The Client Mount daemon, <code>clientmount</code>, provides the support for mounting Rabbit hosted file systems on compute nodes.</li> <li>The Data Movement daemon, <code>nnf-dm</code>, supports creating, monitoring, and managing data movement (copy-offload) operations</li> </ul>"},{"location":"guides/compute-daemons/readme/#building-from-source","title":"Building from source","text":"<p>Each daemon can be built in their respective repositories using the <code>build-daemon</code> make target. Go version &gt;= 1.19 must be installed to perform a local build.</p>"},{"location":"guides/compute-daemons/readme/#rpm-package","title":"RPM Package","text":"<p>Each daemon is packaged as part of the build process in GitHub. Source and Binary RPMs are available.</p>"},{"location":"guides/compute-daemons/readme/#installation","title":"Installation","text":"<p>For manual install, place the binary in the <code>/usr/bin/</code> directory.</p> <p>To install the application as a daemon service, run <code>/usr/bin/[BINARY-NAME] install</code></p>"},{"location":"guides/compute-daemons/readme/#authentication","title":"Authentication","text":"<p>NNF software defines a Kubernetes Service Account for granting communication privileges between the daemon and the kubeapi server. The token file and certificate file can be obtained by providing the necessary Service Account and Namespace to the below shell script.</p> Compute Daemon Service Account Namespace Client Mount nnf-clientmount nnf-system Data Movement nnf-dm-daemon nnf-dm-system <pre><code>#!/bin/bash\n\nSERVICE_ACCOUNT=$1\nNAMESPACE=$2\n\nkubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data.token' | base64 --decode &gt; ./service.token\nkubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data[\"ca.crt\"]' | base64 --decode &gt; ./service.cert\n</code></pre> <p>The <code>service.token</code> and <code>service.cert</code> files must be copied to each compute node, typically in the <code>/etc/[BINARY-NAME]/</code> directory</p>"},{"location":"guides/compute-daemons/readme/#configuration","title":"Configuration","text":"<p>Installing the daemon will create a default configuration located at <code>/etc/systemd/system/[BINARY-NAME].service</code></p> <p>The command line arguments can be provided to the service definition or as an override file.</p> Argument Definition <code>--kubernetes-service-host=[ADDRESS]</code> The IP address or DNS entry of the kubeapi server <code>--kubernetes-service-port=[PORT]</code> The listening port of the kubeapi server <code>--service-token-file=[PATH]</code> Location of the service token file <code>--service-cert-file=[PATH]</code> Location of the service certificate file <code>--node-name=[COMPUTE-NODE-NAME]</code> Name of this compute node as described in the System Configuration. Defaults to the host name reported by the OS. <code>--nnf-node-name=[RABBIT-NODE-NAME]</code> <code>nnf-dm</code> daemon only. Name of the rabbit node connected to this compute node as described in the System Configuration. If not provided, the <code>--node-name</code> value is used to find the associated Rabbit node in the System Configuration. <code>--sys-config=[NAME]</code> <code>nnf-dm</code> daemon only. The System Configuration resource's name. Defaults to <code>default</code> <p>An example unit file for nnf-dm:</p> cat /etc/systemd/system/nnf-dm.service<pre><code>[Unit]\nDescription=Near-Node Flash (NNF) Data Movement Service\n\n[Service]\nPIDFile=/var/run/nnf-dm.pid\nExecStartPre=/bin/rm -f /var/run/nnf-dm.pid\nExecStart=/usr/bin/nnf-dm \\\n   --kubernetes-service-host=127.0.0.1 \\\n   --kubernetes-service-port=7777 \\\n   --service-token-file=/path/to/service.token \\\n   --service-cert-file=/path/to/service.cert \\\n   --kubernetes-qps=50 \\\n   --kubernetes-burst=100\nRestart=on-failure\n\n[Install]\nWantedBy=multi-user.target\n</code></pre> <p>An example unit file is for clientmountd:</p> cat /etc/systemd/system/clientmountd.service<pre><code>[Unit]\nDescription=Near-Node Flash (NNF) Clientmountd Service\n\n[Service]\nPIDFile=/var/run/clientmountd.pid\nExecStartPre=/bin/rm -f /var/run/clientmountd.pid\nExecStart=/usr/bin/clientmountd \\\n   --kubernetes-service-host=127.0.0.1 \\\n   --kubernetes-service-port=7777 \\\n   --service-token-file=/path/to/service.token \\\n   --service-cert-file=/path/to/service.cert\nRestart=on-failure\nEnvironment=GOGC=off\nEnvironment=GOMEMLIMIT=20MiB\nEnvironment=GOMAXPROCS=5\nEnvironment=HTTP2_PING_TIMEOUT_SECONDS=60\n\n[Install]\nWantedBy=multi-user.target\n</code></pre>"},{"location":"guides/compute-daemons/readme/#nnf-dm-specific-configuration","title":"nnf-dm Specific Configuration","text":"<p>nnf-dm has some additional configuration options that can be used to tweak the kubernetes client:</p> Argument Definition <code>--kubernetes-qps=[QPS]</code> The number of Queries Per Second (QPS) before client-side rate-limiting starts. Defaults to 50. <code>--kubernetes-burst=[QPS]</code> Once QPS is hit, allow this many concurrent calls. Defaults to 100."},{"location":"guides/compute-daemons/readme/#easy-deployment","title":"Easy Deployment","text":"<p>The nnf-deploy tool's <code>install</code> command can be used to run the daemons on a system's set of compute nodes. This option will compile the latest daemon binaries, retrieve the service token and certificates, and will copy and install the daemons on each of the compute nodes. Refer to the nnf-deploy repository and run <code>nnf-deploy install --help</code> for details.</p>"},{"location":"guides/data-movement/readme/","title":"Data Movement Configuration","text":"<p>Data Movement can be configured in multiple ways:</p> <ol> <li>Server side</li> <li>Per Copy Offload API Request arguments</li> </ol> <p>The first method is a \"global\" configuration - it affects all data movement operations. The second is done per the Copy Offload API, which allows for some configuration on a per-case basis, but is limited in scope. Both methods are meant to work in tandem.</p>"},{"location":"guides/data-movement/readme/#server-side-configmap","title":"Server Side ConfigMap","text":"<p>The server side configuration is done via the <code>nnf-dm-config</code> config map:</p> <pre><code>kubectl -n nnf-dm-system get configmap nnf-dm-config\n</code></pre> <p>The config map allows you to configure the following:</p> Setting Description slots The number of slots specified in the MPI hostfile. A value less than 1 disables the use of slots in the hostfile. maxSlots The number of max_slots specified in the MPI hostfile. A value less than 1 disables the use of max_slots in the hostfile. command The full command to execute data movement. More detail in the following section. progressIntervalSeconds interval to collect the progress data from the <code>dcp</code> command."},{"location":"guides/data-movement/readme/#command","title":"<code>command</code>","text":"<p>The full data movement <code>command</code> can be set here. By default, Data Movement uses <code>mpirun</code> to run <code>dcp</code> to perform the data movement. Changing the <code>command</code> is useful for tweaking <code>mpirun</code> or <code>dcp</code> options or to replace the command with something that can aid in debugging (e.g. <code>hostname</code>).</p> <p><code>mpirun</code> uses hostfiles to list the hosts to launch <code>dcp</code> on. This hostfile is created for each Data Movement operation, and it uses the config map to set the <code>slots</code> and <code>maxSlots</code> for each host (i.e. NNF node) in the hostfile. The number of <code>slots</code>/<code>maxSlots</code> is the same for every host in the hostfile.</p> <p>Additionally, Data Movement uses substitution to fill in dynamic information for each Data Movement operation. Each of these must be present in the command for Data Movement to work properly when using <code>mpirun</code> and <code>dcp</code>:</p> VAR Description <code>$HOSTFILE</code> hostfile that is created and used for mpirun. <code>$UID</code> User ID that is inherited from the Workflow. <code>$GID</code> Group ID that is inherited from the Workflow. <code>$SRC</code> source for the data movement. <code>$DEST</code> destination for the data movement. <p>By default, the command will look something like the following. Please see the config map itself for the most up to date default command:</p> <pre><code>mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress 1 --uid $UID --gid $GID $SRC $DEST\n</code></pre>"},{"location":"guides/data-movement/readme/#profiles","title":"Profiles","text":"<p>Profiles can be specified in the in the <code>nnf-dm-config</code> config map. Users are able to select a profile using #DW directives (e.g .<code>copy_in profile=my-dm-profile</code>) and the Copy Offload API. If no profile is specified, the <code>default</code> profile is used. This default profile must exist in the config map.</p> <p><code>slots</code>, <code>maxSlots</code>, and <code>command</code> can be stored in Data Movement profiles. These profiles are available to quickly switch between different settings for a particular workflow.</p> <p>Example profiles:</p> <pre><code>profiles:\n  default:\n      slots: 8\n      maxSlots: 0\n      command: mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress 1 --uid $UID --gid $GID $SRC $DEST\n  no-xattrs:\n      slots: 8\n      maxSlots: 0\n      command: mpirun --allow-run-as-root --hostfile $HOSTFILE dcp --progress 1 --xattrs none --uid $UID --gid $GID $SRC $DEST\n</code></pre>"},{"location":"guides/data-movement/readme/#copy-offload-api-daemon","title":"Copy Offload API Daemon","text":"<p>The <code>CreateRequest</code> API call that is used to create Data Movement with the Copy Offload API has some options to allow a user to specify some options for that particular Data Movement. These settings are on a per-request basis.</p> <p>The Copy Offload API requires the <code>nnf-dm</code> daemon to be running on the compute node. This daemon may be configured to run full-time, or it may be left in a disabled state if the WLM is expected to run it only when a user requests it. See Compute Daemons for the systemd service configuration of the daemon. See <code>RequiredDaemons</code> in Directive Breakdown for a description of how the user may request the daemon, in the case where the WLM will run it only on demand.</p> <p>If the WLM is running the <code>nnf-dm</code> daemon only on demand, then the user can request that the daemon be running for their job by specifying <code>requires=copy-offload</code> in their <code>DW</code> directive. The following is an example:</p> <pre><code>#DW jobdw type=xfs capacity=1GB name=stg1 requires=copy-offload\n</code></pre> <p>See the DataMovementCreateRequest API definition for what can be configured.</p>"},{"location":"guides/data-movement/readme/#selinux-and-data-movement","title":"SELinux and Data Movement","text":"<p>Careful consideration must be taken when enabling SELinux on compute nodes. Doing so will result in SELinux Extended File Attributes (xattrs) being placed on files created by applications running on the compute node, which may not be supported by the destination file system (e.g. Lustre).</p> <p>Depending on the configuration of <code>dcp</code>, there may be an attempt to copy these xattrs. You may need to disable this by using <code>dcp --xattrs none</code> to avoid errors. For example, the <code>command</code> in the <code>nnf-dm-config</code> config map or <code>dcpOptions</code> in the DataMovementCreateRequest API could be used to set this option.</p> <p>See the <code>dcp</code> documentation for more information.</p>"},{"location":"guides/directive-breakdown/readme/","title":"Directive Breakdown","text":""},{"location":"guides/directive-breakdown/readme/#background","title":"Background","text":"<p>The <code>#DW</code> directives in a job script are not intended to be interpreted by the workload manager. The workload manager passes the <code>#DW</code> directives to the NNF software through the DWS <code>workflow</code> resource, and the NNF software determines what resources are needed to satisfy the directives. The NNF software communicates this information back to the workload manager through the DWS <code>DirectiveBreakdown</code> resource. This document describes how the WLM should interpret the information in the <code>DirectiveBreakdown</code>.</p>"},{"location":"guides/directive-breakdown/readme/#directivebreakdown-overview","title":"DirectiveBreakdown Overview","text":"<p>The DWS <code>DirectiveBreakdown</code> contains all the information necessary to inform the WLM how to pick storage and compute nodes for a job. The <code>DirectiveBreakdown</code> resource is created by the NNF software during the <code>Proposal</code> phase of the DWS workflow. The <code>spec</code> section of the <code>DirectiveBreakdown</code> is filled in with the <code>#DW</code> directive by the NNF software, and the <code>status</code> section contains the information for the WLM. The WLM should wait until the <code>status.ready</code> field is true before interpreting the rest of the <code>status</code> fields.</p> <p>The contents of the <code>DirectiveBreakdown</code> will look different depending on the file system type and options specified by the user. The <code>status</code> section contains enough information that the WLM may be able to figure out the underlying file system type requested by the user, but the WLM should not make any decisions based on the file system type. Instead, the WLM should make storage and compute allocation decisions based on the generic information provided in the <code>DirectiveBreakdown</code> since the storage and compute allocations needed to satisfy a <code>#DW</code> directive may differ based on options other than the file system type.</p>"},{"location":"guides/directive-breakdown/readme/#storage-nodes","title":"Storage Nodes","text":"<p>The <code>status.storage</code> section of the <code>DirectiveBreakdown</code> describes how the storage allocations should be made and any constraints on the NNF nodes that can be picked. The <code>status.storage</code> section will exist only for <code>jobdw</code> and <code>create_persistent</code> directives. An example of the <code>status.storage</code> section is included below.</p> <pre><code>...\nspec:\n  directive: '#DW jobdw capacity=1GiB type=xfs name=example'\n    userID: 7900\nstatus:\n...\n  ready: true\n  storage:\n    allocationSets:\n    - allocationStrategy: AllocatePerCompute\n      constraints:\n        labels:\n        - dataworkflowservices.github.io/storage=Rabbit\n      label: xfs\n      minimumCapacity: 1073741824\n    lifetime: job\n    reference:\n      kind: Servers\n      name: example-0\n      namespace: default\n...\n</code></pre> <ul> <li> <p><code>status.storage.allocationSets</code> is a list of storage allocation sets that are needed for the job. An allocation set is a group of individual storage allocations that all have the same parameters and requirements. Depending on the storage type specified by the user, there may be more than one allocation set. Allocation sets should be handled independently.</p> </li> <li> <p><code>status.storage.allocationSets.allocationStrategy</code> specifies how the allocations should be made.</p> <ul> <li><code>AllocatePerCompute</code> - One allocation is needed per compute node in the job. The size of an individual allocation is specified in <code>status.storage.allocationSets.minimumCapacity</code></li> <li><code>AllocateAcrossServers</code> - One or more allocations are needed with an aggregate capacity of <code>status.storage.allocationSets.minimumCapacity</code>. This allocation strategy does not imply anything about how many allocations to make per NNF node or how many NNF nodes to use. The allocations on each NNF node should be the same size.</li> <li><code>AllocateSingleServer</code> - One allocation is needed with a capacity of <code>status.storage.allocationSets.minimumCapacity</code></li> </ul> </li> <li> <p><code>status.storage.allocationSets.constraints</code> is a set of requirements for which NNF nodes can be picked. More information about the different constraint types is provided in the Storage Constraints section below.</p> </li> <li> <p><code>status.storage.allocationSets.label</code> is an opaque string that the WLM uses when creating the spec.allocationSets entry in the DWS <code>Servers</code> resource.</p> </li> <li> <p><code>status.storage.allocationSets.minimumCapacity</code> is the allocation capacity in bytes. The interpretation of this field depends on the value of <code>status.storage.allocationSets.allocationStrategy</code></p> </li> <li> <p><code>status.storage.lifetime</code> is used to specify how long the storage allocations will last.</p> <ul> <li><code>job</code> - The allocation will last for the lifetime of the job</li> <li><code>persistent</code> - The allocation will last for longer than the lifetime of the job</li> </ul> </li> <li> <p><code>status.storage.reference</code> is an object reference to a DWS <code>Servers</code> resource where the WLM can specify allocations</p> </li> </ul>"},{"location":"guides/directive-breakdown/readme/#storage-constraints","title":"Storage Constraints","text":"<p>Constraints on an allocation set provide additional requirements for how the storage allocations should be made on NNF nodes.</p> <ul> <li> <p><code>labels</code> specifies a list of labels that must all be on a DWS <code>Storage</code> resource in order for an allocation to exist on that <code>Storage</code>. <pre><code>constraints:\n  labels:\n  - dataworkflowservices.github.io/storage=Rabbit\n  - mysite.org/pool=firmware_test\n</code></pre> <pre><code>apiVersion: dataworkflowservices.github.io/v1alpha2\nkind: Storage\nmetadata:\n  labels:\n    dataworkflowservices.github.io/storage: Rabbit\n    mysite.org/pool: firmware_test\n    mysite.org/drive-speed: fast\n  name: rabbit-node-1\n  namespace: default\n  ...\n</code></pre></p> </li> <li> <p><code>colocation</code> specifies how two or more allocations influence the location of each other. The colocation constraint has two fields, <code>type</code> and <code>key</code>. Currently, the only value for <code>type</code> is <code>exclusive</code>. <code>key</code> can be any value. This constraint means that the allocations from an allocation set with the colocation constraint can't be placed on an NNF node with another allocation whose allocation set has a colocation constraint with the same key. Allocations from allocation sets with colocation constraints with different keys or allocation sets without the colocation constraint are okay to put on the same NNF node. <pre><code>constraints:\n  colocation:\n    type: exclusive\n    key: lustre-mgt\n</code></pre></p> </li> <li> <p><code>count</code> this field specifies the number of allocations to make when <code>status.storage.allocationSets.allocationStrategy</code> is <code>AllocateAcrossServers</code> <pre><code>constraints:\n  count: 5\n</code></pre></p> </li> <li> <p><code>scale</code> is a unitless value from 1-10 that is meant to guide the WLM on how many allocations to make when <code>status.storage.allocationSets.allocationStrategy</code> is <code>AllocateAcrossServers</code>. The actual number of allocations is not meant to correspond to the value of scale. Rather, 1 would indicate the minimum number of allocations to reach <code>status.storage.allocationSets.minimumCapacity</code>, and 10 would be the maximum number of allocations that make sense given the <code>status.storage.allocationSets.minimumCapacity</code> and the compute node count. The NNF software does not interpret this value, and it is up to the WLM to define its meaning. <pre><code>constraints:\n  scale: 8\n</code></pre></p> </li> </ul>"},{"location":"guides/directive-breakdown/readme/#compute-nodes","title":"Compute Nodes","text":"<p>The <code>status.compute</code> section of the <code>DirectiveBreakdown</code> describes how the WLM should pick compute nodes for a job. The <code>status.compute</code> section will exist only for <code>jobdw</code> and <code>persistentdw</code> directives. An example of the <code>status.compute</code> section is included below.</p> <pre><code>...\nspec:\n  directive: '#DW jobdw capacity=1TiB type=lustre name=example'\n    userID: 3450\nstatus:\n...\n  compute:\n    constraints:\n      location:\n      - access:\n        - priority: mandatory\n          type: network\n        - priority: bestEffort\n          type: physical\n        reference:\n          fieldPath: servers.spec.allocationSets[0]\n          kind: Servers\n          name: example-0\n          namespace: default\n      - access:\n        - priority: mandatory\n          type: network\n        reference:\n          fieldPath: servers.spec.allocationSets[1]\n          kind: Servers\n          name: example-0\n          namespace: default\n...\n</code></pre> <p>The <code>status.compute.constraints</code> section lists any constraints on which compute nodes can be used. Currently the only constraint type is the <code>location</code> constraint. <code>status.compute.constraints.location</code> is a list of location constraints that all must be satisfied.</p> <p>A location constraint consists of an <code>access</code> list and a <code>reference</code>.</p> <ul> <li><code>status.compute.constraints.location.reference</code> is an object reference with a <code>fieldPath</code> that points to an allocation set in the <code>Servers</code> resource. If this is from a <code>#DW jobdw</code> directive, the <code>Servers</code> resource won't be filled in until the WLM picks storage nodes for the allocations.</li> <li><code>status.compute.constraints.location.access</code> is a list that specifies what type of access the compute nodes need to have to the storage allocations in the allocation set. An allocation set may have multiple access types that are required<ul> <li><code>status.compute.constraints.location.access.type</code> specifies the connection type for the storage. This can be <code>network</code> or <code>physical</code></li> <li><code>status.compute.constraints.location.access.priority</code> specifies how necessary the connection type is. This can be <code>mandatory</code> or <code>bestEffort</code></li> </ul> </li> </ul>"},{"location":"guides/directive-breakdown/readme/#requireddaemons","title":"RequiredDaemons","text":"<p>The <code>status.requiredDaemons</code> section of the <code>DirectiveBreakdown</code> tells the WLM about any driver-specific daemons it must enable for the job; it is assumed that the WLM knows about the driver-specific daemons and that if the users are specifying these then the WLM knows how to start them. The <code>status.requiredDaemons</code> section will exist only for <code>jobdw</code> and <code>persistentdw</code> directives. An example of the <code>status.requiredDaemons</code> section is included below.</p> <pre><code>status:\n...\n  requiredDaemons:\n  - copy-offload\n...\n</code></pre> <p>The allowed list of required daemons that may be specified is defined in the nnf-ruleset.yaml for DWS, found in the <code>nnf-sos</code> repository. The <code>ruleDefs.key[requires]</code> statement is specified in two places in the ruleset, one for <code>jobdw</code> and the second for <code>persistentdw</code>. The ruleset allows a list of patterns to be specified, allowing one for each of the allowed daemons.</p> <p>The <code>DW</code> directive will include a comma-separated list of daemons after the <code>requires</code> keyword. The following is an example:</p> <pre><code>#DW jobdw type=xfs capacity=1GB name=stg1 requires=copy-offload\n</code></pre> <p>The <code>DWDirectiveRule</code> resource currently active on the system can be viewed with:</p> <pre><code>kubectl get -n dws-system dwdirectiverule nnf -o yaml\n</code></pre>"},{"location":"guides/directive-breakdown/readme/#valid-daemons","title":"Valid Daemons","text":"<p>Each site should define the list of daemons that are valid for that site and recognized by that site's WLM. The initial <code>nnf-ruleset.yaml</code> defines only one, called <code>copy-offload</code>. When a user specifies <code>copy-offload</code> in their <code>DW</code> directive, they are stating that their compute-node application will use the Copy Offload API Daemon described in the Data Movement Configuration.</p>"},{"location":"guides/external-mgs/readme/","title":"Lustre External MGT","text":""},{"location":"guides/external-mgs/readme/#background","title":"Background","text":"<p>Lustre has a limitation where only a single MGT can be mounted on a node at a time. In some situations it may be desirable to share an MGT between multiple Lustre file systems to increase the number of Lustre file systems that can be created and to decrease scheduling complexity. This guide provides instructions on how to configure NNF to share MGTs. There are three methods that can be used:</p> <ol> <li>Use a Lustre MGT from outside the NNF cluster</li> <li>Create a persistent Lustre file system through DWS and use the MGT it provides</li> <li>Create a pool of standalone persistent Lustre MGTs, and have the NNF software select one of them</li> </ol> <p>These three methods are not mutually exclusive on the system as a whole. Individual file systems can use any of options 1-3 or create their own MGT.</p>"},{"location":"guides/external-mgs/readme/#configuration-with-an-external-mgt","title":"Configuration with an External MGT","text":""},{"location":"guides/external-mgs/readme/#storage-profile","title":"Storage Profile","text":"<p>An existing MGT external to the NNF cluster can be used to manage the Lustre file systems on the NNF nodes. An advantage to this configuration is that the MGT can be highly available through multiple MGSs. A disadvantage is that there is only a single MGT. An MGT shared between more than a handful of Lustre file systems is not a common use case, so the Lustre code may prove less stable.</p> <p>The following yaml provides an example of what the <code>NnfStorageProfile</code> should contain to use an MGT on an external server.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: external-mgt\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: 1.2.3.4@eth0:1.2.3.5@eth0\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre>"},{"location":"guides/external-mgs/readme/#nnflustremgt","title":"NnfLustreMGT","text":"<p>A <code>NnfLustreMGT</code> resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a <code>NnfLustreMGT</code> resource. Every MGT must have a corresponding <code>NnfLustreMGT</code> resource. For MGTs that are hosted on NNF hardware, the <code>NnfLustreMGT</code> resources are created automatically. The NNF software also erases any no longer used fsnames from disk for any internally hosted MGTs. For an MGT hosted on an external node, an admin must create an <code>NnfLustreMGT</code>. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the <code>lctl erase_lcfg [fsname]</code> command to remove fsnames that are no longer in use.</p> <p>Below is an example <code>NnfLustreMGT</code> resource. The <code>NnfLustreMGT</code> resource for external MGSs should be created in the <code>nnf-system</code> namespace.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfLustreMGT\nmetadata:\n  name: external-mgt\n  namespace: nnf-system\nspec:\n  addresses:\n  - \"1.2.3.4@eth0:1.2.3.5@eth0\"\n  fsNameStart: \"aaaaaaaa\"\n  fsNameBlackList:\n  - \"mylustre\"\n  fsNameStartReference:\n    name: external-mgt\n    namespace: default\n    kind: ConfigMap\n</code></pre> <ul> <li><code>addresses</code> - This is a list of LNet addresses that could be used for this MGT. This should match any values that are used in the <code>externalMgs</code> field in the <code>NnfStorageProfiles</code>.</li> <li><code>fsNameStart</code> - The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g, <code>aaaaaaaa</code>, <code>aaaaaaab</code>, <code>aaaaaaac</code>). fsnames use lowercase letters <code>'a'</code>-<code>'z'</code>.</li> <li><code>fsNameBlackList</code> - This is a list of fsnames that should not be given to any NNF Lustre file systems. If the MGT is hosting any non-NNF Lustre file systems, their fsnames should be included in this blacklist.</li> <li><code>fsNameStartReference</code> - This is an optional ObjectReference to a <code>ConfigMap</code> that holds a starting fsname. If this field is specified, it takes precedence over the <code>fsNameStart</code> field in the spec. The <code>ConfigMap</code> will be updated to the next available fsname everytime an fsname is assigned to a new Lustre file system.</li> </ul>"},{"location":"guides/external-mgs/readme/#configmap","title":"ConfigMap","text":"<p>For external MGTs, the <code>fsNameStartReference</code> should be used to point to a <code>ConfigMap</code> in the default namespace. The <code>ConfigMap</code> should not be removed during an argocd undeploy/deploy. This allows the nnf-sos sofware to be undeployed (including any <code>NnfLustreMGT</code> resources), without having the fsname reset back to the <code>fsNameStart</code> value on a redeploy. The Configmap that is created should be left empty initially.</p>"},{"location":"guides/external-mgs/readme/#argocd","title":"Argocd","text":"<ul> <li>An empty ConfigMap should be deployed with the <code>0-early-config</code> application.</li> <li>The argocd application for <code>0-early-config</code> should be updated to include the following under <code>ignoreDifferences</code>: <pre><code>  - kind: ConfigMap\n    jsonPointers:\n    - /data\n</code></pre></li> <li>A yaml file for the <code>NnfLustreMGT</code> resource should be deployed with the <code>2-nnf-sos</code> application. It should be created in the <code>nnf-system</code> namespace, and it can have any name. The <code>ConfigMap</code> should be listed in the <code>fsNameStartReference</code> field.</li> <li>The argocd application for <code>2-nnf-sos</code> should be updated to include the following under <code>ignoreDifferences</code>: <pre><code>  - group: nnf.cray.hpe.com\n    kind: NnfLustreMGT\n    jsonPointers:\n    - /spec/claimList\n</code></pre></li> </ul> <p>A separate <code>ConfigMap</code> and <code>NnfLustreMGT</code> is needed for every external Lustre MGT.</p>"},{"location":"guides/external-mgs/readme/#configuration-with-persistent-lustre","title":"Configuration with Persistent Lustre","text":"<p>The MGT from a persistent Lustre file system hosted on the NNF nodes can also be used as the MGT for other NNF Lustre file systems. This configuration has the advantage of not relying on any hardware outside of the cluster. However, there is no high availability, and a single MGT is still shared between all Lustre file systems created on the cluster.</p> <p>To configure a persistent Lustre file system that can share its MGT, a <code>NnfStorageProfile</code> should be used that does not specify <code>externalMgs</code>. The MGT can either share a volume with the MDT or not (<code>combinedMgtMdt</code>).</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: persistent-lustre-shared-mgt\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: \"\"\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre> <p>The persistent storage is created with the following DW directive:</p> <pre><code>#DW create_persistent name=shared-lustre capacity=100GiB type=lustre profile=persistent-lustre-shared-mgt\n</code></pre> <p>After the persistent Lustre file system is created, an admin can discover the MGS address by looking at the <code>NnfStorage</code> resource with the same name as the persistent storage that was created (<code>shared-lustre</code> in the above example).</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorage\nmetadata:\n  name: shared-lustre\n  namespace: default\n[...]\nstatus:\n  mgsNode: 5.6.7.8@eth1\n[...]\n</code></pre> <p>A separate <code>NnfStorageProfile</code> can be created that specifies the MGS address.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: internal-mgt\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: 5.6.7.8@eth1\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre> <p>With this configuration, an admin must determine that no file systems are using the shared MGT before destroying the persistent Lustre instance.</p>"},{"location":"guides/external-mgs/readme/#configuration-with-an-internal-mgt-pool","title":"Configuration with an Internal MGT Pool","text":"<p>Another method NNF supports is to create a number of persistent Lustre MGTs on NNF nodes. These MGTs are not part of a full file system, but are instead added to a pool of MGTs available for other Lustre file systems to use. Lustre file systems that are created will choose one of the MGTs at random to use and add a reference to make sure it isn't destroyed. This configuration has the advantage of spreading the Lustre management load across multiple servers. The disadvantage of this configuration is that it does not provide high availability.</p> <p>To configure the system this way, the first step is to make a pool of Lustre MGTs. This is done by creating a persistent instance from a storage profile that specifies the <code>standaloneMgtPoolName</code> option. This option tells NNF software to only create an MGT, and to add it to a named pool. The following <code>NnfStorageProfile</code> provides an example where the MGT is added to the <code>example-pool</code> pool:</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: mgt-pool-member\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: \"\"\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"example-pool\"\n[...]\n</code></pre> <p>A persistent storage MGTs can be created with the following DW directive:</p> <pre><code>#DW create_persistent name=mgt-pool-member-1 capacity=1GiB type=lustre profile=mgt-pool-member\n</code></pre> <p>Multiple persistent instances with different names can be created using the <code>mgt-pool-member</code> profile to add more than one MGT to the pool.</p> <p>To create a Lustre file system that uses one of the MGTs from the pool, an <code>NnfStorageProfile</code> should be created that uses the special notation <code>pool:[pool-name]</code> in the <code>externalMgs</code> field.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: mgt-pool-consumer\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    externalMgs: \"pool:example-pool\"\n    combinedMgtMdt: false\n    standaloneMgtPoolName: \"\"\n[...]\n</code></pre> <p>The following provides an example DW directive that uses an MGT from the MGT pool:</p> <pre><code>#DW jobdw name=example-lustre capacity=100GiB type=lustre profile=mgt-pool-consumer\n</code></pre> <p>MGT pools are named, so there can be separate pools with collections of different MGTs in them. A storage profile targeting each pool would be needed.</p>"},{"location":"guides/firmware-upgrade/readme/","title":"Firmware Upgrade Procedures","text":"<p>This guide presents the firmware upgrade procedures to upgrade firmware from the Rabbit using tools present in the operating system.</p>"},{"location":"guides/firmware-upgrade/readme/#pcie-switch-firmware-upgrade","title":"PCIe Switch Firmware Upgrade","text":"<p>In order to upgrade the firmware on the PCIe switch, the <code>switchtec</code> kernel driver and utility of the same name must be installed. Rabbit hardware consists of two PCIe switches, which can be managed by devices typically located at <code>/dev/switchtec0</code> and <code>/dev/switchtec1</code>.</p> <p>Danger</p> <p>Upgrading the switch firmware will cause the switch to reset. Prototype Rabbit units not supporting hotplug should undergo a power-cycle to ensure switch initialization following firmware uprade. Similarily, compute nodes not supporting hotplug may lose connectivity after firmware upgrade and should also be power-cycled.</p> <pre><code>IMAGE=$1 # Provide the path to the firmware image file\nSWITCHES=(\"/dev/switchtec0\" \"/dev/switchtec1\")\nfor SWITCH in \"${SWITCHES[@]}\"; do switchtec fw-update \"$SWITCH\" \"$IMAGE\" --yes; done\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#nvme-drive-firmware-upgrade","title":"NVMe Drive Firmware Upgrade","text":"<p>In order to upgrade the firmware on NVMe drives attached to Rabbit, the <code>switchtec</code> and <code>switchtec-nvme</code> executables must be installed. All firmware downloads to drives are sent to the physical function of the drive which is accessible only using the <code>switchtec-nvme</code> executable.</p>"},{"location":"guides/firmware-upgrade/readme/#batch-method","title":"Batch Method","text":""},{"location":"guides/firmware-upgrade/readme/#download-and-commit-new-firmware","title":"Download and Commit New Firmware","text":"<p>The nvme.sh helper script applies the same command to each physical device fabric ID in the system. It provides a convenient way to upgrade the firmware on all drives in the system. Please see fw-download and fw-commit for details about the individual commands.</p> <pre><code># Download firmware to all drives\n./nvme.sh cmd fw-download --fw=&lt;/path/to/nvme.fw&gt;\n\n# Commit the new firmware\n# action=3: The image is requested to be activated immediately\n./nvme.sh cmd fw-commit --action=3\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#rebind-the-pcie-connections","title":"Rebind the PCIe Connections","text":"<p>In order to use the drives at this point, they must be unbound and bound to the PCIe fabric to reset device connections. The bind.sh helper script performs these two actions. Its use is illustrated below.</p> <pre><code># Unbind all drives from the Rabbit to disconnect the PCIe connection to the drives\n./bind.sh unbind\n\n# Bind all drives to the Rabbit to reconnect the PCIe bus\n./bind.sh bind\n\n# At this point, your drives should be running the new firmware.\n# Verify the firmware...\n./nvme.sh cmd id-ctrl | grep -E \"^fr \"\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#individual-drive-method","title":"Individual Drive Method","text":""},{"location":"guides/firmware-upgrade/readme/#determine-physical-device-fabric-id","title":"Determine Physical Device Fabric ID","text":"<p>The first step is to determine a drive's unique Physical Device Fabric Identifier (PDFID). The following code fragment demonstrates one way to list the physcial device fabric ids of all the NVMe drives in the system.</p> <pre><code>#!/bin/bash\n\nSWITCHES=(\"/dev/switchtec0\" \"/dev/switchtec1\")\nfor SWITCH in \"${SWITCHES[@]}\";\ndo\n    mapfile -t PDFIDS &lt; &lt;(sudo switchtec fabric gfms-dump \"${SWITCH}\" | grep \"Function 0 \" -A1 | grep PDFID | awk '{print $2}')\n    for INDEX in \"${!PDFIDS[@]}\";\n    do\n        echo \"${PDFIDS[$INDEX]}@$SWITCH\"\n    done\ndone\n</code></pre> <pre><code># Produces a list like this:\n0x1300@/dev/switchtec0\n0x1600@/dev/switchtec0\n0x1700@/dev/switchtec0\n0x1400@/dev/switchtec0\n0x1800@/dev/switchtec0\n0x1900@/dev/switchtec0\n0x1500@/dev/switchtec0\n0x1a00@/dev/switchtec0\n0x4100@/dev/switchtec1\n0x3c00@/dev/switchtec1\n0x4000@/dev/switchtec1\n0x3e00@/dev/switchtec1\n0x4200@/dev/switchtec1\n0x3b00@/dev/switchtec1\n0x3d00@/dev/switchtec1\n0x3f00@/dev/switchtec1\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#download-firmware","title":"Download Firmware","text":"<p>Using the physical device fabric identifier, the following commands update the firmware for specified drive.</p> <pre><code># Download firmware to the drive\nsudo switchtec-nvme fw-download &lt;PhysicalDeviceFabricID&gt; --fw=&lt;/path/to/nvme.fw&gt;\n\n# Activate the new firmware\n# action=3: The image is requested to be activated immediately without reset.\nsudo switchtec-nvme fw-commit --action=3\n</code></pre>"},{"location":"guides/firmware-upgrade/readme/#rebind-pcie-connection","title":"Rebind PCIe Connection","text":"<p>Once the firmware has been downloaded and committed, the PCIe connection from the Rabbit to the drive must be unbound and rebound. Please see bind.sh for details.</p>"},{"location":"guides/global-lustre/readme/","title":"Global Lustre","text":""},{"location":"guides/global-lustre/readme/#background","title":"Background","text":"<p>Adding global lustre to rabbit systems allows access to external file systems. This is primarily used for Data Movement, where a user can perform <code>copy_in</code> and <code>copy_out</code> directives with global lustre being the source and destination, respectively.</p> <p>Global lustre fileystems are represented by the <code>lustrefilesystems</code> resource in Kubernetes:</p> <pre><code>$ kubectl get lustrefilesystems -A\nNAMESPACE   NAME       FSNAME   MGSNIDS          AGE\ndefault     mylustre   mylustre 10.1.1.113@tcp   20d\n</code></pre> <p>An example resource is as follows:</p> <pre><code>apiVersion: lus.cray.hpe.com/v1beta1\nkind: LustreFileSystem\nmetadata:\n  name: mylustre\n  namespace: default\nspec:\n  mgsNids: 10.1.1.100@tcp\n  mountRoot: /p/mylustre\n  name: mylustre\n  namespaces:\n    default:\n      modes:\n        - ReadWriteMany\n</code></pre>"},{"location":"guides/global-lustre/readme/#namespaces","title":"Namespaces","text":"<p>Note the <code>spec.namespaces</code> field. For each namespace listed, the <code>lustre-fs-operator</code> creates a PV/PVC pair in that namespace. This allows pods in that namespace to access global lustre. The <code>default</code> namespace should appear in this list. This makes the <code>lustrefilesystem</code> resource available to the <code>default</code> namespace, which makes it available to containers (e.g.  container workflows) running in the <code>default</code> namespace.</p> <p>The <code>nnf-dm-system</code> namespace is added automatically - no need to specify that manually here. The NNF Data Movement Manager is responsible for ensuring that the <code>nnf-dm-system</code> is in <code>spec.namespaces</code>. This is to ensure that the NNF DM Worker pods have global lustre mounted as long as <code>nnf-dm</code> is deployed. To unmount global lustre from the NNF DM Worker pods, the <code>lustrefilesystem</code> resource must be deleted.</p> <p>The <code>lustrefilesystem</code> resource itself should be created in the <code>default</code> namespace (i.e. <code>metadata.namespace</code>).</p>"},{"location":"guides/global-lustre/readme/#nnf-data-movement-manager","title":"NNF Data Movement Manager","text":"<p>The NNF Data Movement Manager is responsible for monitoring <code>lustrefilesystem</code> resources to mount (or umount) the global lustre filesystem in each of the NNF DM Worker pods. These pods run on each of the NNF nodes. This means with each addition or removal of <code>lustrefilesystems</code> resources, the DM worker pods restart to adjust their mount points.</p> <p>The NNF Data Movement Manager also places a finalizer on the <code>lustrefilesystem</code> resource to indicate that the resource is in use by Data Movement. This is to prevent the PV/PVC being deleted while they are being used by pods.</p>"},{"location":"guides/global-lustre/readme/#adding-global-lustre","title":"Adding Global Lustre","text":"<p>As mentioned previously, the NNF Data Movement Manager monitors these resources and automatically adds the <code>nnf-dm-system</code> namespace to all <code>lustrefilesystem</code> resources. Once this happens, a PV/PVC is created for the <code>nnf-dm-system</code> namespace to access global lustre. The Manager updates the NNF DM Worker pods, which are then restarted to mount the global lustre file system.</p>"},{"location":"guides/global-lustre/readme/#removing-global-lustre","title":"Removing Global Lustre","text":"<p>When a <code>lustrefilesystem</code> is deleted, the NNF DM Manager takes notice and starts to unmount the file system from the DM Worker pods - causing another restart of the DM Worker pods. Once this is finished, the DM finalizer is removed from the <code>lustrefilesystem</code> resource to signal that it is no longer in use by Data Movement.</p> <p>If a <code>lustrefilesystem</code> does not delete, check the finalizers to see what might still be using it. It is possible to get into a situation where <code>nnf-dm</code> has been undeployed, so there is nothing to remove the DM finalizer from the <code>lustrefilesystem</code> resource. If that is the case, then manually remove the DM finalizer so the deletion of the <code>lustrefilesystem</code> resource can continue.</p>"},{"location":"guides/ha-cluster/notes/","title":"Notes","text":"<p>pcs stonith create stonith-rabbit-node-1 fence_nnf pcmk_host_list=rabbit-node-1 kubernetes-service-host=10.30.107.247 kubernetes-service-port=6443 service-token-file=/etc/nnf/service.token service-cert-file=/etc/nnf/service.cert nnf-node-name=rabbit-node-1 verbose=1</p> <p>pcs stonith create stonith-rabbit-compute-2 fence_redfish pcmk_host_list=\"rabbit-compute-2\" ip=10.30.105.237 port=80 systems-uri=/redfish/v1/Systems/1 username=root password=REDACTED ssl_insecure=true verbose=1</p> <p>pcs stonith create stonith-rabbit-compute-3 fence_redfish pcmk_host_list=\"rabbit-compute-3\" ip=10.30.105.253 port=80 systems-uri=/redfish/v1/Systems/1 username=root password=REDACTED ssl_insecure=true verbose=1</p>"},{"location":"guides/ha-cluster/readme/","title":"High Availability Cluster","text":"<p>NNF software supports provisioning of Red Hat GFS2 (Global File System 2) storage. Per RedHat:</p> <p>GFS2 allows multiple nodes to share storage at a block level as if the storage were connected locally to each cluster node. GFS2 cluster file system requires a cluster infrastructure.</p> <p>Therefore, in order to use GFS2, the NNF node and its associated compute nodes must form a high availability cluster.</p>"},{"location":"guides/ha-cluster/readme/#cluster-setup","title":"Cluster Setup","text":"<p>Red Hat provides instructions for creating a high availability cluster with Pacemaker, including instructions for installing cluster software and creating a high availability cluster. When following these instructions, each of the high availability clusters that are created should be named after the hostname of the NNF node. In the Red Hat examples the cluster name is <code>my_cluster</code>.</p>"},{"location":"guides/ha-cluster/readme/#fencing-agents","title":"Fencing Agents","text":"<p>Fencing is the process of restricting and releasing access to resources that a failed cluster node may have access to. Since a failed node may be unresponsive, an external device must exist that can restrict access to shared resources of that node, or to issue a hard reboot of the node. More information can be found form Red Hat: 1.2.1 Fencing.</p> <p>HPE hardware implements software known as the Hardware System Supervisor (HSS), which itself conforms to the SNIA Redfish/Swordfish standard. This provides the means to manage hardware outside the host OS.</p>"},{"location":"guides/ha-cluster/readme/#nnf-fencing","title":"NNF Fencing","text":""},{"location":"guides/ha-cluster/readme/#source","title":"Source","text":"<p>The NNF Fencing agent is available at https://github.com/NearNodeFlash/fence-agents under the <code>nnf</code> branch.</p> <pre><code>git clone https://github.com/NearNodeFlash/fence-agents --branch nnf\n</code></pre>"},{"location":"guides/ha-cluster/readme/#build","title":"Build","text":"<p>Refer to the <code>NNF.md file</code> at the root directory of the fence-agents repository. The fencing agents must be installed on every node in the cluster.</p>"},{"location":"guides/ha-cluster/readme/#setup","title":"Setup","text":"<p>Configure the NNF agent with the following parameters:</p> Argument Definition <code>kubernetes-service-host=[ADDRESS]</code> The IP address of the kubeapi server <code>kubernetes-service-port=[PORT]</code> The listening port of the kubeapi server <code>service-token-file=[PATH]</code> The location of the service token file. The file must be present on all nodes within the cluster <code>service-cert-file=[PATH]</code> The location of the service certificate file. The file must be present on all nodes within the cluster <code>nnf-node-name=[NNF-NODE-NAME]</code> Name of the NNF node as it is appears in the System Configuration <code>api-version=[VERSION]</code> The API Version of the NNF Node resource. Defaults to \"v1alpha1\" <p>The token and certificate can be found in the Kubernetes Secrets resource for the nnf-system/nnf-fencing-agent ServiceAccount. This provides RBAC rules to limit the fencing agent to only the Kubernetes resources it needs access to.</p> <p>For example, setting up the NNF fencing agent on <code>rabbit-node-1</code> with a kubernetes service API running at <code>192.168.0.1:6443</code> and the service token and certificate copied to <code>/etc/nnf/fence/</code>. This needs to be run on one node in the cluster.</p> <pre><code>pcs stonith create rabbit-node-1 fence_nnf pcmk_host_list=rabbit-node-1 kubernetes-service-host=192.168.0.1 kubernetes-service-port=6443 service-token-file=/etc/nnf/fence/service.token service-cert-file=/etc/nnf/fence/service.cert nnf-node-name=rabbit-node-1\n</code></pre>"},{"location":"guides/ha-cluster/readme/#recovery","title":"Recovery","text":"<p>Since the NNF node is connected to 16 compute blades, careful coordination around fencing of a NNF node is required to minimize the impact of the outage. When a Rabbit node is fenced, the corresponding DWS Storage resource (<code>storages.dws.cray.hpe.com</code>) status changes. The workload manager must observe this change and follow the procedure below to recover from the fencing status.</p> <ol> <li>Observed the <code>storage.Status</code> changed and that <code>storage.Status.RequiresReboot == True</code></li> <li>Set the <code>storage.Spec.State := Disabled</code></li> <li>Wait for a change to the Storage status <code>storage.Status.State == Disabled</code></li> <li>Reboot the NNF node</li> <li>Set the <code>storage.Spec.State := Enabled</code></li> <li>Wait for <code>storage.Status.State == Enabled</code></li> </ol>"},{"location":"guides/ha-cluster/readme/#compute-fencing","title":"Compute Fencing","text":"<p>The Redfish fencing agent from ClusterLabs should be used for Compute nodes in the cluster. It is also included at https://github.com/NearNodeFlash/fence-agents, and can be built at the same time as the NNF fencing agent. Configure the agent with the following parameters:</p> Argument Definition <code>ip=[ADDRESS]</code> The IP address or hostname of the HSS controller <code>port=80</code> The Port of the HSS controller. Must be <code>80</code> <code>systems-uri=/redfish/v1/Systems/1</code> The URI of the Systems object. Must be <code>/redfish/v1/Systems/1</code> <code>ssl-insecure=true</code> Instructs the use of an insecure SSL exchange. Must be <code>true</code> <code>username=[USER]</code> The user name for connecting to the HSS controller <code>password=[PASSWORD]</code> the password for connecting to the HSS controller <p>For example, setting up the Redfish fencing agent on <code>rabbit-compute-2</code> with the redfish service at <code>192.168.0.1</code>. This needs to be run on one node in the cluster.</p> <pre><code>pcs stonith create rabbit-compute-2 fence_redfish pcmk_host_list=rabbit-compute-2 ip=192.168.0.1 systems-uri=/redfish/v1/Systems/1 username=root password=password ssl_insecure=true\n</code></pre>"},{"location":"guides/ha-cluster/readme/#dummy-fencing","title":"Dummy Fencing","text":"<p>The dummy fencing agent from ClusterLabs can be used for nodes in the cluster for an early access development system.</p>"},{"location":"guides/ha-cluster/readme/#configuring-a-gfs2-file-system-in-a-cluster","title":"Configuring a GFS2 file system in a cluster","text":"<p>Follow steps 1-8 of the procedure from Red Hat: Configuring a GFS2 file system in a cluster.</p>"},{"location":"guides/initial-setup/readme/","title":"Initial Setup Instructions","text":"<p>Instructions for the initial setup of a Rabbit are included in this document.</p>"},{"location":"guides/initial-setup/readme/#lvm-configuration-on-rabbit","title":"LVM Configuration on Rabbit","text":"LVM Details <p>Running LVM commands (lvcreate/lvremove) on a Rabbit to create logical volumes is problematic if those commands run within a container. Rabbit Storage Orchestration   code contained in the <code>nnf-node-manager</code> Kubernetes pod executes LVM commands from within the container. The problem is that the LVM create/remove commands wait for a   UDEV confirmation cookie that is set when UDEV rules run within the host OS. These cookies are not synchronized with the containers where the LVM commands execute.</p> <p>3 options to solve this problem are:</p> <ol> <li>Disable UDEV sync at the host operating system level</li> <li>Disable UDEV sync using the <code>\u2013noudevsync</code> command option for each LVM command</li> <li>Clear the UDEV cookie using the <code>dmsetup udevcomplete_all</code> command after the lvcreate/lvremove command.</li> </ol> <p>Taking these in reverse order using option 3 above which allows UDEV settings within the host OS to remain unchanged from the default, one would need to start the   <code>dmsetup</code> command on a separate thread because the LVM create/remove command waits for the UDEV cookie. This opens too many error paths, so it was rejected.</p> <p>Option 2 allows UDEV settings within the host OS to remain unchanged from the default, but the use of UDEV within production Rabbit systems is viewed as unnecessary   because the host OS is PXE-booted onto the node vs loaded from an device that is discovered by UDEV.</p> <p>Option 1 above is what we chose to implement because it is the simplest. The following sections discuss this setting.</p> <p>In order for LVM commands to run within the container environment on a Rabbit, the following change is required to the <code>/etc/lvm/lvm.conf</code> file on Rabbit.</p> <pre><code>sed -i 's/udev_sync = 1/udev_sync = 0/g' /etc/lvm/lvm.conf\n</code></pre>"},{"location":"guides/initial-setup/readme/#zfs","title":"ZFS","text":"<p>ZFS kernel module must be enabled to run on boot. This can be done by creating a file, <code>zfs.conf</code>, containing the string \"zfs\" in your systems modules-load.d directory.</p> <pre><code>echo \"zfs\" &gt; /etc/modules-load.d/zfs.conf\n</code></pre>"},{"location":"guides/initial-setup/readme/#kubernetes-initial-setup","title":"Kubernetes Initial Setup","text":"<p>Installation of Kubernetes (k8s) nodes proceeds by installing k8s components onto the master node(s) of the cluster, then installing k8s components onto the worker nodes and joining those workers to the cluster. The k8s cluster setup for Rabbit requires 3 distinct k8s node types for operation:</p> <ul> <li>Master: 1 or more master nodes which serve as the Kubernetes API server and control access to the system. For HA, at least 3 nodes should be dedicated to this role.</li> <li>Worker: 1 or more worker nodes which run the system level controller manager (SLCM) and Data Workflow Services (DWS) pods. In production, at least 3 nodes should be dedicated to this role.</li> <li>Rabbit: 1 or more Rabbit nodes which run the node level controller manager (NLCM) code. The NLCM daemonset pods are exclusively scheduled on Rabbit nodes. All Rabbit nodes are joined to the cluster as k8s workers, and they are tainted to restrict the type of work that may be scheduled on them. The NLCM pod has a toleration that allows it to run on the tainted (i.e. Rabbit) nodes.</li> </ul>"},{"location":"guides/initial-setup/readme/#kubernetes-node-labels","title":"Kubernetes Node Labels","text":"Node Type Node Label Generic Kubernetes Worker Node cray.nnf.manager=true Rabbit Node cray.nnf.node=true"},{"location":"guides/initial-setup/readme/#kubernetes-node-taints","title":"Kubernetes Node Taints","text":"Node Type Node Label Rabbit Node cray.nnf.node=true:NoSchedule <p>See Taints and Tolerations. The SystemConfiguration controller will handle node taints and labels for the rabbit nodes based on the contents of the SystemConfiguration resource described below.</p>"},{"location":"guides/initial-setup/readme/#rabbit-system-configuration","title":"Rabbit System Configuration","text":"<p>The SystemConfiguration Custom Resource Definition (CRD) is a DWS resource that describes the hardware layout of the whole system. It is expected that an administrator creates a single SystemConfiguration resource when the system is being set up. There is no need to update the SystemConfiguration resource unless hardware is added to or removed from the system.</p> System Configuration Details <p>Rabbit software looks for a SystemConfiguration named <code>default</code> in the <code>default</code> namespace. This resource contains a list of compute nodes and storage nodes, and it describes the mapping between them. There are two different consumers of the SystemConfiguration resource in the NNF software:</p> <p><code>NnfNodeReconciler</code> - The reconciler for the NnfNode resource running on the Rabbit nodes reads the SystemConfiguration resource. It uses the Storage to compute mapping information to fill in the HostName section of the NnfNode resource. This information is then used to populate the DWS Storage resource.</p> <p><code>NnfSystemConfigurationReconciler</code> - This reconciler runs in the <code>nnf-controller-manager</code>. It creates a Namespace for each compute node listed in the SystemConfiguration. These namespaces are used by the client mount code.</p> <p>Here is an example <code>SystemConfiguration</code>:</p> Spec Section Notes computeNodes List of names of compute nodes in the system storageNodes List of Rabbits and the compute nodes attached storageNodes[].type Must be \"Rabbit\" storageNodes[].computeAccess List of {slot, compute name} elements that indicate physical slot index that the named compute node is attached to <pre><code>apiVersion: dataworkflowservices.github.io/v1alpha2\nkind: SystemConfiguration\nmetadata:\n  name: default\n  namespace: default\nspec:\n  computeNodes:\n  - name: compute-01\n  - name: compute-02\n  - name: compute-03\n  - name: compute-04\n  ports:\n  - 5000-5999\n  portsCooldownInSeconds: 0\n  storageNodes:\n  - computesAccess:\n    - index: 0\n      name: compute-01\n    - index: 1\n      name: compute-02\n    - index: 6\n      name: compute-03\n    name: rabbit-name-01\n    type: Rabbit\n  - computesAccess:\n    - index: 4\n      name: compute-04\n    name: rabbit-name-02\n    type: Rabbit\n</code></pre>"},{"location":"guides/node-management/drain/","title":"Disable Or Drain A Node","text":""},{"location":"guides/node-management/drain/#disabling-a-node","title":"Disabling a node","text":"<p>A Rabbit node can be manually disabled, indicating to the WLM that it should not schedule more jobs on the node. Jobs currently on the node will be allowed to complete at the discretion of the WLM.</p> <p>Disable a node by setting its Storage state to <code>Disabled</code>.</p> <pre><code>kubectl patch storage $NODE --type=json -p '[{\"op\":\"replace\", \"path\":\"/spec/state\", \"value\": \"Disabled\"}]'\n</code></pre> <p>When the Storage is queried by the WLM, it will show the disabled status.</p> <pre><code>$ kubectl get storages\nNAME           STATE      STATUS     MODE   AGE\nkind-worker2   Enabled    Ready      Live   10m\nkind-worker3   Disabled   Disabled   Live   10m\n</code></pre> <p>To re-enable a node, set its Storage state to <code>Enabled</code>.</p> <pre><code>kubectl patch storage $NODE --type=json -p '[{\"op\":\"replace\", \"path\":\"/spec/state\", \"value\": \"Enabled\"}]'\n</code></pre> <p>The Storage state will show that it is enabled.</p> <pre><code>kubectl get storages\nNAME           STATE     STATUS   MODE   AGE\nkind-worker2   Enabled   Ready    Live   10m\nkind-worker3   Enabled   Ready    Live   10m\n</code></pre>"},{"location":"guides/node-management/drain/#draining-a-node","title":"Draining a node","text":"<p>The NNF software consists of a collection of DaemonSets and Deployments. The pods on the Rabbit nodes are usually from DaemonSets. Because of this, the <code>kubectl drain</code> command is not able to remove the NNF software from a node.  See Safely Drain a Node for details about the limitations posed by DaemonSet pods.</p> <p>Given the limitations of DaemonSets, the NNF software will be drained by using taints, as described in Taints and Tolerations.</p> <p>This would be used only after the WLM jobs have been removed from that Rabbit (preferably) and there is some reason to also remove the NNF software from it. This might be used before a Rabbit is powered off and pulled out of the cabinet, for example, to avoid leaving pods in \"Terminating\" state (harmless, but it's noise).</p> <p>If an admin used this taint before power-off it would mean there wouldn't be \"Terminating\" pods lying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on.</p>"},{"location":"guides/node-management/drain/#drain-nnf-pods-from-a-rabbit-node","title":"Drain NNF pods from a rabbit node","text":"<p>Drain the NNF software from a node by applying the <code>cray.nnf.node.drain</code> taint. The CSI driver pods will remain on the node to satisfy any unmount requests from k8s as it cleans up the NNF pods.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute\n</code></pre> <p>This will cause the node's <code>Storage</code> resource to be drained:</p> <pre><code>$ kubectl get storages\nNAME           STATE     STATUS    MODE   AGE\nkind-worker2   Enabled   Drained   Live   5m44s\nkind-worker3   Enabled   Ready     Live   5m45s\n</code></pre> <p>The <code>Storage</code> resource will contain the following message indicating the reason it has been drained:</p> <pre><code>$ kubectl get storages rabbit1 -o json | jq -rM .status.message\nKubernetes node is tainted with cray.nnf.node.drain\n</code></pre> <p>To restore the node to service, remove the <code>cray.nnf.node.drain</code> taint.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain-\n</code></pre> <p>The <code>Storage</code> resource will revert to a <code>Ready</code> status.</p>"},{"location":"guides/node-management/drain/#the-csi-driver","title":"The CSI driver","text":"<p>While the CSI driver pods may be drained from a Rabbit node, it is inadvisable to do so.</p> <p>Warning K8s relies on the CSI driver to unmount any filesystems that may have been mounted into a pod's namespace. If it is not present when k8s is attempting to remove a pod then the pod may be left in \"Terminating\" state. This is most obvious when draining the <code>nnf-dm-worker</code> pods which usually have filesystems mounted in them.</p> <p>Drain the CSI driver pod from a node by applying the <code>cray.nnf.node.drain.csi</code> taint.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain.csi=true:NoSchedule cray.nnf.node.drain.csi=true:NoExecute\n</code></pre> <p>To restore the CSI driver pods to that node, remove the <code>cray.nnf.node.drain.csi</code> taint.</p> <pre><code>kubectl taint node $NODE cray.nnf.node.drain.csi-\n</code></pre> <p>This taint will also drain the remaining NNF software if has not already been drained by the <code>cray.nnf.node.drain</code> taint.</p>"},{"location":"guides/node-management/nvme-namespaces/","title":"Debugging NVMe Namespaces","text":""},{"location":"guides/node-management/nvme-namespaces/#total-space-available-or-used","title":"Total Space Available or Used","text":"<p>Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the <code>nnf-node-manager</code> pod on that node.</p> <p>To view the space on node ee50, find its <code>nnf-node-manager</code> pod and then exec into it to query the Redfish API:</p> <pre><code>[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager\nnnf-system             nnf-node-manager-jhglm                               1/1     Running                     0                 61m     10.85.71.11       ee50   &lt;none&gt;           &lt;none&gt;\n</code></pre> <p>Then query the Redfish API to view the <code>AllocatedBytes</code> and <code>GuaranteedBytes</code>:</p> <pre><code>[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq\n{\n  \"@odata.id\": \"/redfish/v1/StorageServices/NNF/CapacitySource\",\n  \"@odata.type\": \"#CapacitySource.v1_0_0.CapacitySource\",\n  \"Id\": \"0\",\n  \"Name\": \"Capacity Source\",\n  \"ProvidedCapacity\": {\n    \"Data\": {\n      \"AllocatedBytes\": 128849888,\n      \"ConsumedBytes\": 128849888,\n      \"GuaranteedBytes\": 307132496928,\n      \"ProvisionedBytes\": 307261342816\n    },\n    \"Metadata\": {},\n    \"Snapshot\": {}\n  },\n  \"ProvidedClassOfService\": {},\n  \"ProvidingDrives\": {},\n  \"ProvidingPools\": {},\n  \"ProvidingVolumes\": {},\n  \"Actions\": {},\n  \"ProvidingMemory\": {},\n  \"ProvidingMemoryChunks\": {}\n}\n</code></pre>"},{"location":"guides/node-management/nvme-namespaces/#total-orphaned-or-leaked-space","title":"Total Orphaned or Leaked Space","text":"<p>To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no <code>NnfNodeBlockStorages</code> in the k8s namespace with the Rabbit's name:</p> <pre><code>[richerso@ee1:~]$ kubectl get nnfnodeblockstorage -n ee50\nNo resources found in ee50 namespace.\n</code></pre> <p>To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:</p> <pre><code>[root@ee50:~]# nvme list\nNode                  SN                   Model                                    Namespace Usage                      Format           FW Rev\n--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------\n/dev/nvme0n1          S666NN0TB11877       SAMSUNG MZ1L21T9HCLS-00A07               1           8.57  GB /   1.92  TB    512   B +  0 B   GDC7302Q\n</code></pre> <p>There should be no namespaces on the kioxia drives:</p> <pre><code>[root@ee50:~]# nvme list | grep -i kioxia\n[root@ee50:~]#\n</code></pre> <p>If there are namespaces listed, and there weren't any <code>NnfNodeBlockStorages</code> on the node, then they need to be deleted through the Rabbit software. The <code>NnfNodeECData</code> resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod:</p> <pre><code>kubectl delete nnfnodeecdata ec-data -n ee50\nkubectl delete pod -n nnf-system nnf-node-manager-jhglm\n</code></pre>"},{"location":"guides/rbac-for-users/readme/","title":"RBAC: Role-Based Access Control","text":"<p>RBAC (Role Based Access Control) determines the operations a user or service can perform on a list of Kubernetes resources. RBAC affects everything that interacts with the kube-apiserver (both users and services internal or external to the cluster). More information about RBAC can be found in the Kubernetes documentation.</p>"},{"location":"guides/rbac-for-users/readme/#rbac-for-users","title":"RBAC for Users","text":"<p>This section shows how to create a kubeconfig file with RBAC set up to restrict access to view only for resources.</p>"},{"location":"guides/rbac-for-users/readme/#overview","title":"Overview","text":"<p>User access to a Kubernetes cluster is defined through a kubeconfig file. This file contains the address of the kube-apiserver as well as the key and certificate for the user. Typically this file is located in <code>~/.kube/config</code>. When a kubernetes cluster is created, a config file is generated for the admin that allows unrestricted access to all resources in the cluster. This is the equivalent of <code>root</code> on a Linux system.</p> <p>The goal of this document is to create a new kubeconfig file that allows view only access to Kubernetes resources. This kubeconfig file can be shared between the HPE employees to investigate issues on the system. This involves:</p> <ul> <li>Generating a new key/cert pair for an \"hpe\" user</li> <li>Creating a new kubeconfig file</li> <li>Adding RBAC rules for the \"hpe\" user to allow read access</li> </ul>"},{"location":"guides/rbac-for-users/readme/#generate-a-key-and-certificate","title":"Generate a Key and Certificate","text":"<p>The first step is to create a new key and certificate so that HPE employees can authenticate as the \"hpe\" user. This will likely be done on one of the master nodes. The <code>openssl</code> command needs access to the certificate authority file. This is typically located in <code>/etc/kubernetes/pki</code>.</p> <pre><code># make a temporary work space\nmkdir /tmp/rabbit\ncd /tmp/rabbit\n\n# Create this user\nexport USERNAME=hpe\n\n# generate a new key\nopenssl genrsa -out rabbit.key 2048\n\n# create a certificate signing request for this user\nopenssl req -new -key rabbit.key -out rabbit.csr -subj \"/CN=$USERNAME\"\n\n# generate a certificate using the certificate authority on the k8s cluster. This certificate lasts 500 days\nopenssl x509 -req -in rabbit.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out rabbit.crt -days 500\n</code></pre>"},{"location":"guides/rbac-for-users/readme/#create-a-kubeconfig","title":"Create a kubeconfig","text":"<p>After the keys have been generated, a new kubeconfig file can be created for this user. The admin kubeconfig <code>/etc/kubernetes/admin.conf</code> can be used to determine the cluster name kube-apiserver address.</p> <pre><code># create a new kubeconfig with the server information\nkubectl config set-cluster $CLUSTER_NAME --kubeconfig=/tmp/rabbit/rabbit.conf --server=$SERVER_ADDRESS --certificate-authority=/etc/kubernetes/pki/ca.crt --embed-certs=true\n\n# add the key and cert for this user to the config\nkubectl config set-credentials $USERNAME --kubeconfig=/tmp/rabbit/rabbit.conf --client-certificate=/tmp/rabbit/rabbit.crt --client-key=/tmp/rabbit/rabbit.key --embed-certs=true\n\n# add a context\nkubectl config set-context $USERNAME --kubeconfig=/tmp/rabbit/rabbit.conf --cluster=$CLUSTER_NAME --user=$USERNAME\n</code></pre> <p>The kubeconfig file should be placed in a location where HPE employees have read access to it.</p>"},{"location":"guides/rbac-for-users/readme/#create-clusterrole-and-clusterrolebinding","title":"Create ClusterRole and ClusterRoleBinding","text":"<p>The next step is to create ClusterRole and ClusterRoleBinding resources. The ClusterRole provided allows viewing all cluster and namespace scoped resources, but disallows creating, deleting, or modifying any resources.</p> <p>ClusterRole <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRole\nmetadata:\n  name: hpe-viewer\nrules:\n  - apiGroups: [ \"*\" ]\n    resources: [ \"*\" ]\n    verbs: [ get, list ]\n</code></pre></p> <p>ClusterRoleBinding <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: hpe-viewer\nsubjects:\n- kind: User\n  name: hpe\n  apiGroup: rbac.authorization.k8s.io\nroleRef:\n  kind: ClusterRole\n  name: hpe-viewer\n  apiGroup: rbac.authorization.k8s.io\n</code></pre></p> <p>Both of these resources can be created using the <code>kubectl apply</code> command.</p>"},{"location":"guides/rbac-for-users/readme/#testing","title":"Testing","text":"<p>Get, List, Create, Delete, and Modify operations can be tested as the \"hpe\" user by setting the KUBECONFIG environment variable to use the new kubeconfig file. Get and List should be the only allowed operations. Other operations should fail with a \"forbidden\" error.</p> <pre><code>export KUBECONFIG=/tmp/hpe/hpe.conf\n</code></pre>"},{"location":"guides/rbac-for-users/readme/#rbac-for-workload-manager-wlm","title":"RBAC for Workload Manager (WLM)","text":"<p>Note This section assumes the reader has read and understood the steps described above for setting up <code>RBAC for Users</code>.</p> <p>A workload manager (WLM) such as Flux or Slurm will interact with DataWorkflowServices as a privileged user. RBAC is used to limit the operations that a WLM can perform on a Rabbit system.</p> <p>The following steps are required to create a user and a role for the WLM.  In this case, we're creating a user to be used with the Flux WLM:</p> <ul> <li>Generate a new key/cert pair for a \"flux\" user</li> <li>Creating a new kubeconfig file</li> <li>Adding RBAC rules for the \"flux\" user to allow appropriate access to the DataWorkflowServices API.</li> </ul>"},{"location":"guides/rbac-for-users/readme/#generate-a-key-and-certificate_1","title":"Generate a Key and Certificate","text":"<p>Generate a key and certificate for our \"flux\" user, similar to the way we created one for the \"hpe\" user above.  Substitute \"flux\" in place of \"hpe\".</p>"},{"location":"guides/rbac-for-users/readme/#create-a-kubeconfig_1","title":"Create a kubeconfig","text":"<p>After the keys have been generated, a new kubeconfig file can be created for the \"flux\" user, similar to the one for the \"hpe\" user above.  Again, substitute \"flux\" in place of \"hpe\".</p>"},{"location":"guides/rbac-for-users/readme/#use-the-provided-clusterrole-and-create-a-clusterrolebinding","title":"Use the provided ClusterRole and create a ClusterRoleBinding","text":"<p>DataWorkflowServices has already defined the role to be used with WLMs, named <code>dws-workload-manager</code>:</p> <pre><code>kubectl get clusterrole dws-workload-manager\n</code></pre> <p>If the \"flux\" user requires only the normal WLM permissions, then create and apply a ClusterRoleBinding to associate the \"flux\" user with the <code>dws-workload-manager</code> ClusterRole.</p> <p>The `dws-workload-manager role is defined in workload_manager_role.yaml.</p> <p>ClusterRoleBinding for WLM permissions only: <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: flux\nsubjects:\n- kind: User\n  name: flux\n  apiGroup: rbac.authorization.k8s.io\nroleRef:\n  kind: ClusterRole\n  name: dws-workload-manager\n  apiGroup: rbac.authorization.k8s.io\n</code></pre></p> <p>If the \"flux\" user requires the normal WLM permissions as well as some of the NNF permissions, perhaps to collect some NNF resources for debugging, then create and apply a ClusterRoleBinding to associate the \"flux\" user with the <code>nnf-workload-manager</code> ClusterRole.</p> <p>The <code>nnf-workload-manager</code> role is defined in workload_manager_nnf_role.yaml.</p> <p>ClusterRoleBinding for WLM and NNF permissions: <pre><code>apiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: flux\nsubjects:\n- kind: User\n  name: flux\n  apiGroup: rbac.authorization.k8s.io\nroleRef:\n  kind: ClusterRole\n  name: nnf-workload-manager\n  apiGroup: rbac.authorization.k8s.io\n</code></pre></p> <p>The WLM should then use the kubeconfig file associated with this \"flux\" user to access the DataWorkflowServices API and the Rabbit system.</p>"},{"location":"guides/storage-profiles/readme/","title":"Storage Profile Overview","text":"<p>Storage Profiles allow for customization of the Rabbit storage provisioning process. Examples of content that can be customized via storage profiles is</p> <ol> <li>The RAID type used for storage</li> <li>Any mkfs or LVM args used</li> <li>An external MGS NID for Lustre</li> <li>A boolean value indicating the Lustre MGT and MDT should be combined on the same target device </li> </ol> <p>DW directives that allocate storage on Rabbit nodes allow a <code>profile</code> parameter to be specified to control how the storage is configured. NNF software provides a set of canned profiles to choose from, and the administrator may create more profiles.</p> <p>The administrator shall choose one profile to be the default profile that is used when a profile parameter is not specified.</p>"},{"location":"guides/storage-profiles/readme/#specifying-a-profile","title":"Specifying a Profile","text":"<p>To specify a profile name on a #DW directive, use the <code>profile</code> option <pre><code>#DW jobdw type=lustre profile=durable capacity=5GB name=example\n</code></pre></p>"},{"location":"guides/storage-profiles/readme/#setting-a-default-profile","title":"Setting A Default Profile","text":"<p>A default profile must be defined at all times. Any #DW line that does not specify a profile will use the default profile. If a default profile is not defined, then any new workflows will be rejected. If more than one profile is marked as default then any new workflows will be rejected.</p> <p>To query existing profiles</p> <pre><code>$ kubectl get nnfstorageprofiles -A\nNAMESPACE    NAME          DEFAULT   AGE\nnnf-system   durable       true      14s\nnnf-system   performance   false     6s\n</code></pre> <p>To set the default flag on a profile <pre><code>$ kubectl patch nnfstorageprofile performance -n nnf-system --type merge -p '{\"data\":{\"default\":true}}'\n</code></pre></p> <p>To clear the default flag on a profile <pre><code>$ kubectl patch nnfstorageprofile durable -n nnf-system --type merge -p '{\"data\":{\"default\":false}}'\n</code></pre></p>"},{"location":"guides/storage-profiles/readme/#creating-the-initial-default-profile","title":"Creating The Initial Default Profile","text":"<p>Create the initial default profile from scratch or by using the NnfStorageProfile/template resource as a template. If <code>nnf-deploy</code> was used to install nnf-sos then the default profile described below will have been created automatically.</p> <p>To use the <code>template</code> resource begin by obtaining a copy of it either from the nnf-sos repo or from a live system. To get it from a live system use the following command:</p> <pre><code>kubectl get nnfstorageprofile -n nnf-system template -o yaml &gt; profile.yaml\n</code></pre> <p>Edit the <code>profile.yaml</code> file to trim the metadata section to contain only a name and namespace. The namespace must be left as nnf-system, but the name should be set to signify that this is the new default profile. In this example we will name it <code>default</code>.  The metadata section will look like the following, and will contain no other fields:</p> <pre><code>metadata:\n  name: default\n  namespace: nnf-system\n</code></pre> <p>Mark this new profile as the default profile by setting <code>default: true</code> in the data section of the resource:</p> <pre><code>data:\n  default: true\n</code></pre> <p>Apply this resource to the system and verify that it is the only one marked as the default resource:</p> <pre><code>kubectl get nnfstorageprofile -A\n</code></pre> <p>The output will appear similar to the following:</p> <pre><code>NAMESPACE    NAME       DEFAULT   AGE\nnnf-system   default    true      9s\nnnf-system   template   false     11s\n</code></pre> <p>The administrator should edit the <code>default</code> profile to record any cluster-specific settings. Maintain a copy of this resource YAML in a safe place so it isn't lost across upgrades.</p>"},{"location":"guides/storage-profiles/readme/#keeping-the-default-profile-updated","title":"Keeping The Default Profile Updated","text":"<p>An upgrade of nnf-sos may include updates to the <code>template</code> profile. It may be necessary to manually copy these updates into the <code>default</code> profile.</p>"},{"location":"guides/storage-profiles/readme/#profile-parameters","title":"Profile Parameters","text":""},{"location":"guides/storage-profiles/readme/#xfs","title":"XFS","text":"<p>The following shows how to specify command line options for pvcreate, vgcreate, lvcreate, and mkfs for XFS storage. Optional mount options are specified one per line</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: xfs-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  xfsStorage:\n    commandlines:\n      pvCreate: $DEVICE\n      vgCreate: $VG_NAME $DEVICE_LIST\n      lvCreate: -l 100%VG --stripes $DEVICE_NUM --stripesize=32KiB --name $LV_NAME $VG_NAME\n      mkfs: $DEVICE\n    options:\n      mountRabbit:\n      - noatime\n      - nodiratime\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#gfs2","title":"GFS2","text":"<p>The following shows how to specify command line options for pvcreate, lvcreate, and mkfs for GFS2.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: gfs2-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  gfs2Storage:\n    commandlines:\n      pvCreate: $DEVICE\n      vgCreate: $VG_NAME $DEVICE_LIST\n      lvCreate: -l 100%VG --stripes $DEVICE_NUM --stripesize=32KiB --name $LV_NAME $VG_NAME\n      mkfs: -j2 -p $PROTOCOL -t $CLUSTER_NAME:$LOCK_SPACE $DEVICE\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#lustre-zfs","title":"Lustre / ZFS","text":"<p>The following shows how to specify a zpool virtual device (vdev). In this case the default vdev is a stripe. See zpoolconcepts(7) for virtual device descriptions.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: zpool-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n    mgtCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --mgs $VOL_NAME\n    mdtCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --mdt --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n    mgtMdtCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --mgs --mdt --fsname=$FS_NAME --index=$INDEX $VOL_NAME\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#zfs-dataset-properties","title":"ZFS dataset properties","text":"<p>The following shows how to specify ZFS dataset properties in the <code>--mkfsoptions</code> arg for mkfs.lustre. See zfsprops(7).</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: zpool-stripe-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n[...]\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --mkfsoptions=\"recordsize=1024K -o compression=lz4\" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#mount-options-for-targets","title":"Mount Options for Targets","text":""},{"location":"guides/storage-profiles/readme/#persistent-mount-options","title":"Persistent Mount Options","text":"<p>Use the mkfs.lustre <code>--mountfsoptions</code> parameter to set persistent mount options for Lustre targets.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: target-mount-option-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n[...]\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --mountfsoptions=\"errors=remount-ro,mballoc\" --mkfsoptions=\"recordsize=1024K -o compression=lz4\" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#non-persistent-mount-options","title":"Non-Persistent Mount Options","text":"<p>Non-persistent mount options can be specified with the ostOptions.mountTarget parameter to the NnfStorageProfile:</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: target-mount-option-example\n  namespace: nnf-system\ndata:\n[...]\n  lustreStorage:\n[...]\n    ostCommandlines:\n      zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST\n      mkfs: --ost --mountfsoptions=\"errors=remount-ro\" --mkfsoptions=\"recordsize=1024K -o compression=lz4\" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME\n    ostOptions:\n      mountTarget:\n      - mballoc\n[...]\n</code></pre>"},{"location":"guides/storage-profiles/readme/#target-layout","title":"Target Layout","text":"<p>Users may want Lustre file systems with different performance characteristics. For example, a user job with a single compute node accessing the Lustre file system would see acceptable performance from a single OSS. An FPP workload might want as many OSSs as posible to avoid contention.</p> <p>The <code>NnfStorageProfile</code> allows admins to specify where and how many Lustre targets are allocated by the WLM. During the proposal phase of the workflow, the NNF software uses the information in the <code>NnfStorageProfile</code> to add extra constraints in the <code>DirectiveBreakdown</code>. The WLM uses these constraints when picking storage.</p> <p>The <code>NnfStorageProfile</code> has three fields in the <code>mgtOptions</code>, <code>mdtOptions</code>, and <code>ostOptions</code> to specify target layout. The fields are:</p> <ul> <li><code>count</code> - A static value for how many Lustre targets to create.</li> <li><code>scale</code> - A value from 1-10 that the WLM can use to determine how many Lustre targets to allocate. This is up to the WLM and the admins to agree on how to interpret this field. A value of 1 might indicate the minimum number of NNF nodes needed to reach the minimum capacity, while 10 might result in a Lustre target on every Rabbit attached to the computes in the job. Scale takes into account allocation size, compute node count, and Rabbit count.</li> <li><code>colocateComputes</code> - true/false value. When \"true\", this adds a location constraint in the <code>DirectiveBreakdown</code> that limits the WLM to picking storage with a physical connection to the compute resources. In practice this means that Rabbit storage is restricted to the chassis used by the job. This can be set individually for each of the Lustre target types. When this is \"false\", any Rabbit storage can be picked, even if the Rabbit doesn't share a chassis with any of the compute nodes in the job.</li> </ul> <p>Only one of <code>scale</code> and <code>count</code> can be set for a particular target type.</p> <p>The <code>DirectiveBreakdown</code> for <code>create_persistent</code> #DWs won't include the constraint from <code>colocateCompute=true</code> since there may not be any compute nodes associated with the job.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfStorageProfile\nmetadata:\n  name: high-metadata\n  namespace: default\ndata:\n  default: false\n...\n  lustreStorage:\n    combinedMgtMdt: false\n    capacityMdt: 500GiB\n    capacityMgt: 1GiB\n[...]\n    ostOptions:\n      scale: 5\n      colocateComputes: true\n    mdtOptions:\n      count: 10\n</code></pre>"},{"location":"guides/storage-profiles/readme/#example-layouts","title":"Example Layouts","text":"<p><code>scale</code> with <code>colocateComputes=true</code> will likely be the most common layout type to use for <code>jobdw</code> directives. This will result in a Lustre file system whose performance scales with the number of compute nodes in the job.</p> <p><code>count</code> may be used when a specific performance characteristic is desired such as a single shared file workload that has low metadata requirements and only needs a single MDT. It may also be useful when a consistently performing file system is required across different jobs.</p> <p><code>colocatedComputes=false</code> may be useful for placing MDTs on NNF nodes without an OST (within the same file system).</p> <p>The <code>count</code> field may be useful when creating a persistent file system since the job with the <code>create_persistent</code> directive may only have a single compute node.</p> <p>In general, <code>scale</code> gives a simple way for users to get a filesystem that has performance consistent with their job size. <code>count</code> is useful for times when a user wants full control of the file system layout.</p>"},{"location":"guides/storage-profiles/readme/#command-line-variables","title":"Command Line Variables","text":""},{"location":"guides/storage-profiles/readme/#pvcreate","title":"pvcreate","text":"<ul> <li><code>$DEVICE</code> - expands to the <code>/dev/&lt;path&gt;</code> value for one device that has been allocated</li> </ul>"},{"location":"guides/storage-profiles/readme/#vgcreate","title":"vgcreate","text":"<ul> <li><code>$VG_NAME</code> - expands to a volume group name that is controlled by Rabbit software.</li> <li><code>$DEVICE_LIST</code> - expands to a list of space-separated <code>/dev/&lt;path&gt;</code> devices. This list will contain the devices that were iterated over for the pvcreate step.</li> </ul>"},{"location":"guides/storage-profiles/readme/#lvcreate","title":"lvcreate","text":"<ul> <li><code>$VG_NAME</code> - see vgcreate above.</li> <li><code>$LV_NAME</code> - expands to a logical volume name that is controlled by Rabbit software.</li> <li><code>$DEVICE_NUM</code> - expands to a number indicating the number of devices allocated for the volume group.</li> <li><code>$DEVICE1, $DEVICE2, ..., $DEVICEn</code> - each expands to one of the devices from the <code>$DEVICE_LIST</code> above.</li> </ul>"},{"location":"guides/storage-profiles/readme/#xfs-mkfs","title":"XFS mkfs","text":"<ul> <li><code>$DEVICE</code> - expands to the <code>/dev/&lt;path&gt;</code> value for the logical volume that was created by the lvcreate step above.</li> </ul>"},{"location":"guides/storage-profiles/readme/#gfs2-mkfs","title":"GFS2 mkfs","text":"<ul> <li><code>$DEVICE</code> - expands to the <code>/dev/&lt;path&gt;</code> value for the logical volume that was created by the lvcreate step above.</li> <li><code>$CLUSTER_NAME</code> - expands to a cluster name that is controlled by Rabbit Software</li> <li><code>$LOCK_SPACE</code> - expands to a lock space key that is controlled by Rabbit Software.</li> <li><code>$PROTOCOL</code> - expands to a locking protocol that is controlled by Rabbit Software.</li> </ul>"},{"location":"guides/storage-profiles/readme/#zpool-create","title":"zpool create","text":"<ul> <li><code>$DEVICE_LIST</code> - expands to a list of space-separated <code>/dev/&lt;path&gt;</code> devices. This list will contain the devices that were allocated for this storage request.</li> <li><code>$POOL_NAME</code> - expands to a pool name that is controlled by Rabbit software.</li> <li><code>$DEVICE_NUM</code> - expands to a number indicating the number of devices allocated for this storage request.</li> <li><code>$DEVICE1, $DEVICE2, ..., $DEVICEn</code> - each expands to one of the devices from the <code>$DEVICE_LIST</code> above.</li> </ul>"},{"location":"guides/storage-profiles/readme/#lustre-mkfs","title":"lustre mkfs","text":"<ul> <li><code>$FS_NAME</code> - expands to the filesystem name that was passed to Rabbit software from the workflow's #DW line.</li> <li><code>$MGS_NID</code> - expands to the NID of the MGS. If the MGS was orchestrated by nnf-sos then an appropriate internal value will be used.</li> <li><code>$POOL_NAME</code> - see zpool create above.</li> <li><code>$VOL_NAME</code> - expands to the volume name that will be created. This value will be <code>&lt;pool_name&gt;/&lt;dataset&gt;</code>, and is controlled by Rabbit software.</li> <li><code>$INDEX</code> - expands to the index value of the target and is controlled by Rabbit software.</li> </ul>"},{"location":"guides/user-containers/readme/","title":"NNF User Containers","text":"<p>NNF User Containers are a mechanism to allow user-defined containerized applications to be run on Rabbit nodes with access to NNF ephemeral and persistent storage.</p>"},{"location":"guides/user-containers/readme/#overview","title":"Overview","text":"<p>Container workflows are orchestrated through the use of two components: Container Profiles and Container Directives. A Container Profile defines the container to be executed. Most importantly, it allows you to specify which NNF storages are accessible within the container and which container image to run. The containers are executed on the NNF nodes that are allocated to your container workflow. These containers can be executed in either of two modes: Non-MPI and MPI.</p> <p>For Non-MPI applications, the image and command are launched across all the targeted NNF Nodes in a uniform manner. This is useful in simple applications, where non-distributed behavior is desired.</p> <p>For MPI applications, a single launcher container serves as the point of contact, responsible for distributing tasks to various worker containers. Each of the NNF nodes targeted by the workflow receives its corresponding worker container. The focus of this documentation will be on MPI applications.</p> <p>To see a full working example before diving into these docs, see Putting It All Together.</p>"},{"location":"guides/user-containers/readme/#before-creating-a-container-workflow","title":"Before Creating a Container Workflow","text":"<p>Before creating a workflow, a working <code>NnfContainerProfile</code> must exist. This profile is referenced in the container directive supplied with the workflow.</p>"},{"location":"guides/user-containers/readme/#container-profiles","title":"Container Profiles","text":"<p>The author of a containerized application will work with the administrator to define a pod specification template for the container and to create an appropriate <code>NnfContainerProfile</code> resource for the container. The image and tag for the user's container will be specified in the profile.</p> <p>The image must be available in a registry that is available to your system. This could be docker.io, ghcr.io, etc., or a private registry. Note that for a private registry, some additional setup is required. See here for more info.</p> <p>The image itself has a few requirements. See here for more info on building images.</p> <p>New <code>NnfContainerProfile</code> resources may be created by copying one of the provided example profiles from the <code>nnf-system</code> namespace . The examples may be found by listing them with <code>kubectl</code>:</p> <pre><code>kubectl get nnfcontainerprofiles -n nnf-system\n</code></pre> <p>The next few subsections provide an overview of the primary components comprising an <code>NnfContainerProfile</code>. However, it's important to note that while these sections cover the key aspects, they don't encompass every single detail. For an in-depth understanding of the capabilities offered by container profiles, we recommend referring to the following resources:</p> <ul> <li>Type definition for <code>NnfContainerProfile</code></li> <li>Sample for <code>NnfContainerProfile</code></li> <li>Online Examples for <code>NnfContainerProfile</code> (same as <code>kubectl get</code> above)</li> </ul>"},{"location":"guides/user-containers/readme/#container-storages","title":"Container Storages","text":"<p>The <code>Storages</code> defined in the profile allow NNF filesystems to be made available inside of the container. These storages need to be referenced in the container workflow unless they are marked as optional.</p> <p>There are three types of storages available to containers:</p> <ul> <li>local non-persistent storage (created via <code>#DW jobdw</code> directives)</li> <li>persistent storage (created via <code>#DW create_persistent</code> directives)</li> <li>global lustre storage (defined by <code>LustreFilesystems</code>)</li> </ul> <p>For local and persistent storage, only GFS2 and Lustre filesystems are supported. Raw and XFS filesystems cannot be mounted more than once, so they cannot be mounted inside of a container while also being mounted on the NNF node itself.</p> <p>For each storage in the profile, the name must follow these patterns (depending on the storage type):</p> <ul> <li><code>DW_JOB_&lt;storage_name&gt;</code></li> <li><code>DW_PERSISTENT_&lt;storage_name&gt;</code></li> <li><code>DW_GLOBAL_&lt;storage_name&gt;</code></li> </ul> <p><code>&lt;storage_name&gt;</code> is provided by the user and needs to be a name compatible with Linux environment variables (so underscores must be used, not dashes), since the storage mount directories are provided to the container via environment variables.</p> <p>This storage name is used in container workflow directives to reference the NNF storage name that defines the filesystem. Find more info on that in Creating a Container Workflow.</p> <p>Storages may be deemed as <code>optional</code> in a profile. If a storage is not optional, the storage name must be set to the name of an NNF filesystem name in the container workflow.</p> <p>For global lustre, there is an additional field for <code>pvcMode</code>, which must match the mode that is configured in the <code>LustreFilesystem</code> resource that represents the global lustre filesystem. This defaults to <code>ReadWriteMany</code>.</p> <p>Example:</p> <pre><code>  storages:\n  - name: DW_JOB_foo_local_storage\n    optional: false\n  - name: DW_PERSISTENT_foo_persistent_storage\n    optional: true\n  - name: DW_GLOBAL_foo_global_lustre\n    optional: true\n    pvcMode: ReadWriteMany\n</code></pre>"},{"location":"guides/user-containers/readme/#container-spec","title":"Container Spec","text":"<p>As mentioned earlier, container workflows can be categorized into two types: MPI and Non-MPI. It's essential to choose and define only one of these types within the container profile. Regardless of the type chosen, the data structure that implements the specification is equipped with two \"standard\" resources that are distinct from NNF custom resources.</p> <p>For Non-MPI containers, the specification utilizes the <code>spec</code> resource. This is the standard Kubernetes <code>PodSpec</code> that outlines the desired configuration for the pod.</p> <p>For MPI containers, <code>mpiSpec</code> is used. This custom resource, available through <code>MPIJobSpec</code> from <code>mpi-operator</code>, serves as a facilitator for executing MPI applications across worker containers. This resource can be likened to a wrapper around a <code>PodSpec</code>, but users need to define a <code>PodSpec</code> for both Launcher and Worker containers.</p> <p>See the <code>MPIJobSpec</code> definition for more details on what can be configured for an MPI application.</p> <p>It's important to bear in mind that the NNF Software is designed to override specific values within the <code>MPIJobSpec</code> for ensuring the desired behavior in line with NNF software requirements. To prevent complications, it's advisable not to delve too deeply into the specification. A few illustrative examples of fields that are overridden by the NNF Software include:</p> <ul> <li>Replicas</li> <li>RunPolicy.BackoffLimit</li> <li>Worker/Launcher.RestartPolicy</li> <li>SSHAuthMountPath</li> </ul> <p>By keeping these considerations in mind and refraining from extensive alterations to the specification, you can ensure a smoother integration with the NNF Software and mitigate any potential issues that may arise.</p> <p>Please see the Sample and Examples listed above for more detail on container Specs.</p>"},{"location":"guides/user-containers/readme/#container-ports","title":"Container Ports","text":"<p>Container Profiles allow for ports to be reserved for a container workflow. <code>numPorts</code> can be used to specify the number of ports needed for a container workflow. The ports are opened on each targeted NNF node and are accessible outside of the cluster. Users must know how to contact the specific NNF node. It is recommend that DNS entries are made for this purpose.</p> <p>In the workflow, the allocated port numbers are made available via the <code>NNF_CONTAINER_PORTS</code> environment variable.</p> <p>The workflow requests this number of ports from the <code>NnfPortManager</code>, which is responsible for managing the ports allocated to container workflows. This resource can be inspected to see which ports are allocated.</p> <p>Once a port is assigned to a workflow, that port number becomes unavailable for use by any other workflow until it is released.</p> <p>Note</p> <p>The <code>SystemConfiguration</code> must be configured to allow for a range of ports, otherwise container workflows will fail in the <code>Setup</code> state due to insufficient resources. See SystemConfiguration Setup.</p>"},{"location":"guides/user-containers/readme/#systemconfiguration-setup","title":"SystemConfiguration Setup","text":"<p>In order for container workflows to request ports from the <code>NnfPortManager</code>, the <code>SystemConfiguration</code> must be configured for a range of ports:</p> <pre><code>kind: SystemConfiguration\nmetadata:\n  name: default\n  namespace: default\nspec:\n  # Ports is the list of ports available for communication between nodes in the\n  # system. Valid values are single integers, or a range of values of the form\n  # \"START-END\" where START is an integer value that represents the start of a\n  # port range and END is an integer value that represents the end of the port\n  # range (inclusive).\n  ports:\n    - 4000-4999\n  # PortsCooldownInSeconds is the number of seconds to wait before a port can be\n  # reused. Defaults to 60 seconds (to match the typical value for the kernel's\n  # TIME_WAIT). A value of 0 means the ports can be reused immediately.\n  # Defaults to 60s if not set.\n  portsCooldownInSeconds: 60\n</code></pre> <p><code>ports</code> is empty by default, and must be set by an administrator.</p> <p>Multiple port ranges can be specified in this list, as well as single integers. This must be a safe port range that does not interfere with the ephemeral port range of the Linux kernel. The range should also account for the estimated number of simultaneous users that are running container workflows.</p> <p>Once a container workflow is done, the port is released and the <code>NnfPortManager</code> will not allow reuse of the port until the amount of time specified by <code>portsCooldownInSeconds</code> has elapsed. Then the port can be reused by another container workflow.</p>"},{"location":"guides/user-containers/readme/#restricting-to-user-id-or-group-id","title":"Restricting To User ID or Group ID","text":"<p>New NnfContainerProfile resources may be restricted to a specific user ID or group ID . When a <code>data.userID</code> or <code>data.groupID</code> is specified in the profile, only those Workflow resources having a matching user ID or group ID will be allowed to use that profile . If the profile specifies both of these IDs, then the Workflow resource must match both of them.</p>"},{"location":"guides/user-containers/readme/#creating-a-container-workflow","title":"Creating a Container Workflow","text":"<p>The user's workflow will specify the name of the <code>NnfContainerProfile</code> in a DW directive. If the custom profile is named <code>red-rock-slushy</code> then it will be specified in the <code>#DW container</code> directive with the <code>profile</code> parameter.</p> <pre><code>#DW container profile=red-rock-slushy  [...]\n</code></pre> <p>Furthermore, to set the container storages for the workflow, storage parameters must also be supplied in the workflow. This is done using the <code>&lt;storage_name&gt;</code> (see Container Storages) and setting it to the name of a storage directive that defines an NNF filesystem. That storage directive must already exist as part of another workflow (e.g. persistent storage) or it can be supplied in the same workflow as the container. For global lustre, the <code>LustreFilesystem</code> must exist that represents the global lustre filesystem.</p> <p>In this example, we're creating a GFS2 filesystem to accompany the container directive. We're using the <code>red-rock-slushy</code> profile which contains a non-optional storage called <code>DW_JOB_local_storage</code>:</p> <pre><code>kind: NnfContainerProfile\nmetadata:\n  name: red-rock-slushy\ndata:\n  storages:\n  - name: DW_JOB_local_storage\n    optional: false\n  template:\n    mpiSpec:\n      ...\n</code></pre> <p>The resulting container directive looks like this:</p> <pre><code>#DW jobdw name=my-gfs2 type=gfs2 capacity=100GB\"\n#DW container name=my-container profile=red-rock-slushy DW_JOB_local_storage=my-gfs2\n</code></pre> <p>Once the workflow progresses, this will create a 100GB GFS2 filesystem that is then mounted into the container upon creation. An environment variable called <code>DW_JOB_local_storage</code> is made available inside of the container and provides the path to the mounted NNF GFS2 filesystem. An application running inside of the container can then use this variable to get to the filesystem mount directory. See here.</p> <p>Multiple storages can be defined in the container directives. Only one container directive is allowed per workflow.</p> <p>Note</p> <p>GFS2 filesystems have special considerations since the mount directory contains directories for every compute node. See GFS2 Index Mounts for more info.</p>"},{"location":"guides/user-containers/readme/#targeting-nodes","title":"Targeting Nodes","text":"<p>For container directives, compute nodes must be assigned to the workflow. The NNF software will trace the compute nodes back to their local NNF nodes and the containers will be executed on those NNF nodes. The act of assigning compute nodes to your container workflow instructs the NNF software to select the NNF nodes that run the containers.</p> <p>For the <code>jobdw</code> directive that is included above, the servers (i.e. NNF nodes) must also be assigned along with the computes.</p>"},{"location":"guides/user-containers/readme/#running-a-container-workflow","title":"Running a Container Workflow","text":"<p>Once the workflow is created, the WLM progresses it through the following states. This is a quick overview of the container-related behavior that occurs:</p> <ul> <li>Proposal: Verify storages are provided according to the container profile.</li> <li>Setup: If applicable, request ports from NnfPortManager.</li> <li>DataIn: No container related activity.</li> <li>PreRun: Appropriate <code>MPIJob</code> or <code>Job(s)</code> are created for the workflow. In turn, user containers are created and launched by Kubernetes. Containers are expected to start in this state.</li> <li>PostRun: Once in PostRun, user containers are expected to complete (non-zero exit) successfully.</li> <li>DataOut: No container related activity.</li> <li>Teardown: Ports are released; <code>MPIJob</code> or <code>Job(s)</code> are deleted, which in turn deletes the user containers.</li> </ul> <p>The two main states of a container workflow (i.e. PreRun, PostRun) are discussed further in the following sections.</p>"},{"location":"guides/user-containers/readme/#prerun","title":"PreRun","text":"<p>In PreRun, the containers are created and expected to start. Once the containers reach a non-initialization state (i.e. Running), the containers are considered to be started and the workflow can advance.</p> <p>By default, containers are expected to start within 60 seconds. If not, the workflow reports an Error that the containers cannot be started. This value is configurable via the <code>preRunTimeoutSeconds</code> field in the container profile.</p> <p>To summarize the PreRun behavior:</p> <ul> <li>If the container starts successfully (running), transition to <code>Completed</code> status.</li> <li>If the container fails to start, transition to the <code>Error</code> status.</li> <li>If the container is initializing and has not started after <code>preRunTimeoutSeconds</code> seconds, terminate the container and transition to the <code>Error</code> status.</li> </ul>"},{"location":"guides/user-containers/readme/#init-containers","title":"Init Containers","text":"<p>The NNF Software injects Init Containers into the container specification to perform initialization tasks. These containers must run to completion before the main container can start.</p> <p>These initialization tasks include:</p> <ul> <li>Ensuring the proper permissions (i.e. UID/GID) are available in the main container</li> <li>For MPI jobs, ensuring the launcher pod can contact each worker pod via DNS</li> </ul>"},{"location":"guides/user-containers/readme/#prerun-completed","title":"PreRun Completed","text":"<p>Once PreRun has transitioned to <code>Completed</code> status, the user container is now running and the WLM should initiate applications on the compute nodes. Utilizing container ports, the applications on the compute nodes can establish communication with the user containers, which are running on the local NNF node attached to the computes.</p> <p>This communication allows for the compute node applications to drive certain behavior inside of the user container. For example, once the compute node application is complete, it can signal to the user container that it is time to perform cleanup or data migration action.</p>"},{"location":"guides/user-containers/readme/#postrun","title":"PostRun","text":"<p>In PostRun, the containers are expected to exit cleanly with a zero exit code. If a container fails to exit cleanly, the Kubernetes software attempts a number of retries based on the configuration of the container profile. It continues to do this until the container exits successfully, or until the <code>retryLimit</code> is hit - whichever occurs first. In the latter case, the workflow reports an Error.</p> <p>Read up on the Failure Retries for more information on retries.</p> <p>Furthermore, the container profile features a <code>postRunTimeoutSeconds</code> field. If this timeout is reached before the container successfully exits, it triggers an <code>Error</code> status. The timer for this timeout begins upon entry into the PostRun phase, allowing the containers the specified period to execute before the workflow enters an <code>Error</code> status.</p> <p>To recap the PostRun behavior:</p> <ul> <li>If the container exits successfully, transition to <code>Completed</code> status.</li> <li>If the container exits unsuccessfully after <code>retryLimit</code> number of retries, transition to the <code>Error</code> status.</li> <li>If the container is running and has not exited after <code>postRunTimeoutSeconds</code> seconds, terminate the container and transition to the <code>Error</code> status.</li> </ul>"},{"location":"guides/user-containers/readme/#failure-retries","title":"Failure Retries","text":"<p>If a container fails (non-zero exit code), the Kubernetes software implements retries. The number of retries can be set via the <code>retryLimit</code> field in the container profile. If a non-zero exit code is detected, the Kubernetes software creates a new instance of the pod and retries. The default number of retries for <code>retryLimit</code> is set to 6, which is the default value for Kubernetes Jobs. This means that if the pods fails every single time, there will be 7 failed pods in total since it attempted 6 retries after the first failure.</p> <p>To understand this behavior more, see Pod backoff failure policy in the Kubernetes documentation. This explains the retry (i.e. backoff) behavior in more detail.</p> <p>It is important to note that due to the configuration of the <code>MPIJob</code> and/or <code>Job</code> that is created for User Containers, the container retries are immediate - there is no backoff timeout between retires. This is due to the NNF Software setting the <code>RestartPolicy</code> to <code>Never</code>, which causes a new pod to spin up after every failure rather than re-use (i.e. restart) the previously failed pod. This allows a user to see a complete history of the failed pod(s) and the logs can easily be obtained. See more on this at Handling Pod and container failures in the Kubernetes documentation.</p>"},{"location":"guides/user-containers/readme/#putting-it-all-together","title":"Putting it All Together","text":"<p>See the NNF Container Example for a working example of how to run a simple MPI application inside of an NNF User Container and run it through a Container Workflow.</p>"},{"location":"guides/user-containers/readme/#reference","title":"Reference","text":""},{"location":"guides/user-containers/readme/#environment-variables","title":"Environment Variables","text":"<p>Two sets of environment variables are available with container workflows: Container and Compute Node. The former are the variables that are available inside the user containers. The latter are the variables that are provided back to the DWS workflow, which in turn are collected by the WLM and provided to compute nodes. See the WLM documentation for more details.</p>"},{"location":"guides/user-containers/readme/#container-environment-variables","title":"Container Environment Variables","text":"<p>These variables are provided for use inside the container. They can be used as part of the container command in the NNF Container Profile or within the container itself.</p>"},{"location":"guides/user-containers/readme/#storages","title":"Storages","text":"<p>Each storage defined by a container profile and used in a container workflow results in a corresponding environment variable. This variable is used to hold the mount directory of the filesystem.</p>"},{"location":"guides/user-containers/readme/#gfs2-index-mounts","title":"GFS2 Index Mounts","text":"<p>When using a GFS2 file system, each compute is allocated its own NNF volume. The NNF software mounts a collection of directories that are indexed (e.g. <code>0/</code>, <code>1/</code>, etc) to the compute nodes.</p> <p>Application authors must be aware that their desired GFS2 mount-point really a collection of directories, one for each compute node. It is the responsibility of the author to understand the underlying filesystem mounted at the storage environment variable (e.g. <code>$DW_JOB_my_gfs2_storage</code>).</p> <p>Each compute node's application can leave breadcrumbs (e.g. hostnames) somewhere on the GFS2 filesystem mounted on the compute node. This can be used to identify the index mount directory to a compute node from the application running inside of the user container.</p> <p>Here is an example of 3 compute nodes on an NNF node targeted in a GFS2 workflow:</p> <pre><code>$ ls $DW_JOB_my_gfs2_storage/*\n/mnt/nnf/3e92c060-ca0e-4ddb-905b-3d24137cbff4-0/0\n/mnt/nnf/3e92c060-ca0e-4ddb-905b-3d24137cbff4-0/1\n/mnt/nnf/3e92c060-ca0e-4ddb-905b-3d24137cbff4-0/2\n</code></pre> <p>Node positions are not absolute locations. The WLM could, in theory, select 6 physical compute nodes at physical location 1, 2, 3, 5, 8, 13, which would appear as directories <code>/0</code> through <code>/5</code> in the container mount path.</p> <p>Additionally, not all container instances could see the same number of compute nodes in an indexed-mount scenario. If 17 compute nodes are required for the job, WLM may assign 16 nodes to run one NNF node, and 1 node to another NNF. The first NNF node would have 16 index directories, whereas the 2nd would only contain 1.</p>"},{"location":"guides/user-containers/readme/#hostnames-and-domains","title":"Hostnames and Domains","text":"<p>Containers can contact one another via Kubernetes cluster networking. This functionality is provided by DNS. Environment variables are provided that allow a user to be able to piece together the FQDN so that the other containers can be contacted.</p> <p>This example demonstrates an MPI container workflow, with two worker pods. Two worker pods means two pods/containers running on two NNF nodes.</p>"},{"location":"guides/user-containers/readme/#ports","title":"Ports","text":"<p>See the <code>NNF_CONTAINER_PORTS</code> section under Compute Node Environment Variables.</p> <pre><code>mpiuser@my-container-workflow-launcher:~$ env | grep NNF\nNNF_CONTAINER_HOSTNAMES=my-container-workflow-launcher my-container-workflow-worker-0 my-container-workflow-worker-1\nNNF_CONTAINER_DOMAIN=default.svc.cluster.local\nNNF_CONTAINER_SUBDOMAIN=my-container-workflow-worker\n</code></pre> <p>The container FQDN consists of the following: <code>&lt;HOSTNAME&gt;.&lt;SUBDOMAIN&gt;.&lt;DOMAIN&gt;</code>. To contact the other worker container from worker 0, <code>my-container-workflow-worker-1.my-container-workflow-worker.default.svc.cluster.local</code> would be used.</p> <p>For MPI-based containers, an alternate way to retrieve this information is to look at the default <code>hostfile</code>, provided by <code>mpi-operator</code>. This file lists out all the worker nodes' FQDNs:</p> <pre><code>mpiuser@my-container-workflow-launcher:~$ cat /etc/mpi/hostfile\nmy-container-workflow-worker-0.my-container-workflow-worker.default.svc slots=1\nmy-container-workflow-worker-1.my-container-workflow-worker.default.svc slots=1\n</code></pre>"},{"location":"guides/user-containers/readme/#compute-node-environment-variables","title":"Compute Node Environment Variables","text":"<p>These environment variables are provided to the compute node via the WLM by way of the DWS Workflow. Note that these environment variables are consistent across all the compute nodes for a given workflow.</p> <p>Note</p> <p>It's important to note that the variables presented here pertain exclusively to User Container-related variables. This list does not encompass the entirety of NNF environment variables accessible to the compute node through the Workload Manager (WLM)</p>"},{"location":"guides/user-containers/readme/#nnf_container_ports","title":"<code>NNF_CONTAINER_PORTS</code>","text":"<p>If the NNF Container Profile requests container ports, then this environment variable provides the allocated ports for the container. This is a comma separated list of ports if multiple ports are requested.</p> <p>This allows an application on the compute node to contact the user container running on its local NNF node via these port numbers. The compute node must have proper routing to the NNF Node and needs a generic way of contacting the NNF node. It is suggested than a DNS entry is provided via <code>/etc/hosts</code>, or similar.</p> <p>For cases where one port is requested, the following can be used to contact the user container running on the NNF node (assuming a DNS entry for <code>local-rabbit</code> is provided via <code>/etc/hosts</code>).</p> <pre><code>local-rabbit:$(NNF_CONTAINER_PORTS)\n</code></pre>"},{"location":"guides/user-containers/readme/#creating-images","title":"Creating Images","text":"<p>For details, refer to the NNF Container Example Readme. However, in broad terms, an image that is capable of supporting MPI necessitates the following components:</p> <ul> <li>User Application: Your specific application</li> <li>Open MPI: Incorporate Open MPI to facilitate MPI operations</li> <li>SSH Server: Including an SSH server to enable communication</li> <li>nslookup: To validate Launcher/Worker container communication over the network</li> </ul> <p>By ensuring the presence of these components, users can create an image that supports MPI operations on the NNF platform.</p> <p>The nnf-mfu image serves as a suitable base image, encompassing all the essential components required for this purpose.</p>"},{"location":"guides/user-containers/readme/#using-a-private-container-repository","title":"Using a Private Container Repository","text":"<p>The user's containerized application may be placed in a private repository . In this case, the user must define an access token to be used with that repository, and that token must be made available to the Rabbit's Kubernetes environment so that it can pull that container from the private repository.</p> <p>See Pull an Image from a Private Registry in the Kubernetes documentation for more information.</p>"},{"location":"guides/user-containers/readme/#about-the-example","title":"About the Example","text":"<p>Each container registry will have its own way of letting its users create tokens to be used with their repositories . Docker Hub will be used for the private repository in this example, and the user's account on Docker Hub will be \"dean\".</p>"},{"location":"guides/user-containers/readme/#preparing-the-private-repository","title":"Preparing the Private Repository","text":"<p>The user's application container is named \"red-rock-slushy\" . To store this container on Docker Hub the user must log into docker.com with their browser and click the \"Create repository\" button to create a repository named \"red-rock-slushy\", and the user must check the box that marks the repository as private . The repository's name will be displayed as \"dean/red-rock-slushy\" with a lock icon to show that it is private.</p>"},{"location":"guides/user-containers/readme/#create-and-push-a-container","title":"Create and Push a Container","text":"<p>The user will create their container image in the usual ways, naming it for their private repository and tagging it according to its release.</p> <p>Prior to pushing images to the repository, the user must complete a one-time login to the Docker registry using the docker command-line tool.</p> <pre><code>docker login -u dean\n</code></pre> <p>After completing the login, the user may then push their images to the repository.</p> <pre><code>docker push dean/red-rock-slushy:v1.0\n</code></pre>"},{"location":"guides/user-containers/readme/#generate-a-read-only-token","title":"Generate a Read-Only Token","text":"<p>A read-only token must be generated to allow Kubernetes to pull that container image from the private repository, because Kubernetes will not be running as that user . This token must be given to the administrator, who will use it to create a Kubernetes secret.</p> <p>To log in and generate a read-only token to share with the administrator, the user must follow these steps:</p> <ul> <li>Visit docker.com and log in using their browser.</li> <li>Click on the username in the upper right corner.</li> <li>Select \"Account Settings\" and navigate to \"Security\".</li> <li>Click the \"New Access Token\" button to create a read-only token.</li> <li>Keep a copy of the generated token to share with the administrator.</li> </ul>"},{"location":"guides/user-containers/readme/#store-the-read-only-token-as-a-kubernetes-secret","title":"Store the Read-Only Token as a Kubernetes Secret","text":"<p>The administrator must store the user's read-only token as a kubernetes secret . The secret must be placed in the <code>default</code> namespace, which is the same namespace where the user containers will be run . The secret must include the user's Docker Hub username and the email address they have associated with that username . In this case, the secret will be named <code>readonly-red-rock-slushy</code>.</p> <pre><code>USER_TOKEN=users-token-text\nUSER_NAME=dean\nUSER_EMAIL=dean@myco.com\nSECRET_NAME=readonly-red-rock-slushy\nkubectl create secret docker-registry $SECRET_NAME -n default --docker-server=\"https://index.docker.io/v1/\" --docker-username=$USER_NAME --docker-password=$USER_TOKEN --docker-email=$USER_EMAIL\n</code></pre>"},{"location":"guides/user-containers/readme/#add-the-secret-to-the-nnfcontainerprofile","title":"Add the Secret to the NnfContainerProfile","text":"<p>The administrator must add an <code>imagePullSecrets</code> list to the NnfContainerProfile resource that was created for this user's containerized application.</p> <p>The following profile shows the placement of the <code>readonly-red-rock-slushy</code> secret which was created in the previous step, and points to the user's <code>dean/red-rock-slushy:v1.0</code> container.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfContainerProfile\nmetadata:\n  name: red-rock-slushy\n  namespace: nnf-system\ndata:\n  pinned: false\n  retryLimit: 6\n  spec:\n    imagePullSecrets:\n    - name: readonly-red-rock-slushy\n    containers:\n    - command:\n      - /users-application\n      image: dean/red-rock-slushy:v1.0\n      name: red-rock-app\n  storages:\n  - name: DW_JOB_foo_local_storage\n    optional: false\n  - name: DW_PERSISTENT_foo_persistent_storage\n    optional: true\n</code></pre> <p>Now any user can select this profile in their Workflow by specifying it in a <code>#DW container</code> directive.</p> <pre><code>#DW container profile=red-rock-slushy  [...]\n</code></pre>"},{"location":"guides/user-containers/readme/#using-a-private-container-repository-for-mpi-application-containers","title":"Using a Private Container Repository for MPI Application Containers","text":"<p>If our user's containerized application instead contains an MPI application, because perhaps it's a private copy of nnf-mfu, then the administrator would insert two <code>imagePullSecrets</code> lists into the <code>mpiSpec</code> of the NnfContainerProfile for the MPI launcher and the MPI worker.</p> <pre><code>apiVersion: nnf.cray.hpe.com/v1alpha1\nkind: NnfContainerProfile\nmetadata:\n  name: mpi-red-rock-slushy\n  namespace: nnf-system\ndata:\n  mpiSpec:\n    mpiImplementation: OpenMPI\n    mpiReplicaSpecs:\n      Launcher:\n        template:\n          spec:\n            imagePullSecrets:\n            - name: readonly-red-rock-slushy\n            containers:\n            - command:\n              - mpirun\n              - dcmp\n              - $(DW_JOB_foo_local_storage)/0\n              - $(DW_JOB_foo_local_storage)/1\n              image: dean/red-rock-slushy:v2.0\n              name: red-rock-launcher\n      Worker:\n        template:\n          spec:\n            imagePullSecrets:\n            - name: readonly-red-rock-slushy\n            containers:\n            - image: dean/red-rock-slushy:v2.0\n              name: red-rock-worker\n    runPolicy:\n      cleanPodPolicy: Running\n      suspend: false\n    slotsPerWorker: 1\n    sshAuthMountPath: /root/.ssh\n  pinned: false\n  retryLimit: 6\n  storages:\n  - name: DW_JOB_foo_local_storage\n    optional: false\n  - name: DW_PERSISTENT_foo_persistent_storage\n    optional: true\n</code></pre> <p>Now any user can select this profile in their Workflow by specifying it in a <code>#DW container</code> directive.</p> <pre><code>#DW container profile=mpi-red-rock-slushy  [...]\n</code></pre>"},{"location":"guides/user-interactions/readme/","title":"Rabbit User Interactions","text":""},{"location":"guides/user-interactions/readme/#overview","title":"Overview","text":"<p>A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form <code>#DW [command] [command args]</code>, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes.</p> <p>Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information.</p>"},{"location":"guides/user-interactions/readme/#commands","title":"Commands","text":""},{"location":"guides/user-interactions/readme/#jobdw","title":"jobdw","text":"<p>The <code>jobdw</code> directive command tells the Rabbit software to create a file system on the Rabbit hardware for the lifetime of the user's job. At the end of the job, any data that is not moved off of the file system either by the application or through a <code>copy_out</code> directive will be lost. Multiple <code>jobdw</code> directives can be listed in the same job script. </p>"},{"location":"guides/user-interactions/readme/#command-arguments","title":"Command Arguments","text":"Argument Required Value Notes <code>type</code> Yes <code>raw</code>, <code>xfs</code>, <code>gfs2</code>, <code>lustre</code> Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. <code>capacity</code> Yes Allocation size with units. <code>1TiB</code>, <code>100GB</code>, etc. Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: <code>KB</code>, <code>KiB</code>, <code>MB</code>, <code>MiB</code>, <code>GB</code>, <code>GiB</code>, <code>TB</code>, <code>TiB</code> <code>name</code> Yes String including numbers and '-' This is a name for the storage allocation that is unique within a job <code>profile</code> No Profile name This specifies which profile to use when allocating storage. Profiles include <code>mkfs</code> and <code>mount</code> arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. More information about storage profiles can be found in the Storage Profiles guide. <code>requires</code> No <code>copy-offload</code> Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application. See the Required Daemons section of the Directive Breakdown guide for a description of how the user may request the daemon, in the case where the WLM will run it only on demand."},{"location":"guides/user-interactions/readme/#examples","title":"Examples","text":"<pre><code>#DW jobdw type=xfs capacity=10GiB name=scratch\n</code></pre> <p>This directive results in a 10GiB xfs file system created for each compute node in the job using the default storage profile.</p> <pre><code>#DW jobdw type=lustre capacity=1TB name=dw-temp profile=high-metadata\n</code></pre> <p>This directive results in a single 1TB Lustre file system being created that can be accessed from all the compute nodes in the job. It is using a storage profile that an admin created to give high Lustre metadata performance.</p> <pre><code>#DW jobdw type=gfs2 capacity=50GB name=checkpoint requires=copy-offload\n</code></pre> <p>This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running using the Copy Offload API.</p>"},{"location":"guides/user-interactions/readme/#create_persistent","title":"create_persistent","text":"<p>The <code>create_persistent</code> command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single <code>create_persistent</code> directive is allowed in a job, and it cannot be in the same job as a <code>destroy_persistent</code> directive. See persistentdw to utilize the storage in a job.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_1","title":"Command Arguments","text":"Argument Required Value Notes <code>type</code> Yes <code>raw</code>, <code>xfs</code>, <code>gfs2</code>, <code>lustre</code> Type defines how the storage should be formatted. For Lustre file systems, a single file system is created. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node in the job. <code>capacity</code> Yes Allocation size with units. <code>1TiB</code>, <code>100GB</code>, etc. Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: <code>KB</code>, <code>KiB</code>, <code>MB</code>, <code>MiB</code>, <code>GB</code>, <code>GiB</code>, <code>TB</code>, <code>TiB</code> <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the storage allocation that is unique within the system <code>profile</code> No Profile name This specifies which profile to use when allocating storage. Profiles include <code>mkfs</code> and <code>mount</code> arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. The profile used when creating the persistent storage allocation is the same profile used by jobs that use the persistent storage. More information about storage profiles can be found in the Storage Profiles guide."},{"location":"guides/user-interactions/readme/#examples_1","title":"Examples","text":"<pre><code>#DW create_persistent type=xfs capacity=100GiB name=scratch\n</code></pre> <p>This directive results in a 100GiB xfs file system created for each compute node in the job using the default storage profile. Since xfs file systems are not network accessible, subsequent jobs that want to use the file system must have the same number of compute nodes, and be scheduled on compute nodes with access to the correct Rabbit nodes. This means the job with the <code>create_persistent</code> directive must schedule the desired number of compute nodes even if no application is run on the compute nodes as part of the job.</p> <pre><code>#DW create_persistent type=lustre capacity=10TiB name=shared-data profile=read-only\n</code></pre> <p>This directive results in a single 10TiB Lustre file system being created that can be accessed later by any compute nodes in the system. Multiple jobs can access a Rabbit Lustre file system at the same time. This job can be scheduled with a single compute node (or zero compute nodes if the WLM allows), without any limitations on compute node counts for subsequent jobs using the persistent Lustre file system.</p>"},{"location":"guides/user-interactions/readme/#destroy_persistent","title":"destroy_persistent","text":"<p>The <code>destroy_persistent</code> command will delete persistent storage that was allocated by a corresponding <code>create_persistent</code>. If the persistent storage is currently in use by a job, then the job containing the <code>destroy_persistent</code> command will fail. Only a single <code>destroy_persistent</code> directive is allowed in a job, and it cannot be in the same job as a <code>create_persistent</code> directive.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_2","title":"Command Arguments","text":"Argument Required Value Notes <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the persistent storage allocation that will be destroyed"},{"location":"guides/user-interactions/readme/#examples_2","title":"Examples","text":"<pre><code>#DW destroy_persistent name=shared-data\n</code></pre> <p>This directive will delete the persistent storage allocation with the name <code>shared-data</code></p>"},{"location":"guides/user-interactions/readme/#persistentdw","title":"persistentdw","text":"<p>The <code>persistentdw</code> command makes an existing persistent storage allocation available to a job. The persistent storage must already be created from a <code>create_persistent</code> command in a different job script. Multiple <code>persistentdw</code> commands can be used in the same job script to request access to multiple persistent allocations.</p> <p>Persistent Lustre file systems can be accessed from any compute nodes in the system, and the compute node count for the job can vary as needed. Multiple jobs can access a persistent Lustre file system concurrently if desired. Raw, xfs, and GFS2 file systems can only be accessed by compute nodes that have a physical connection to the Rabbits hosting the storage, and jobs accessing these storage types must have the same compute node count as the job that made the persistent storage.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_3","title":"Command Arguments","text":"Argument Required Value Notes <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the persistent storage that will be accessed <code>requires</code> No <code>copy-offload</code> Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application. See the Required Daemons section of the Directive Breakdown guide for a description of how the user may request the daemon, in the case where the WLM will run it only on demand."},{"location":"guides/user-interactions/readme/#examples_3","title":"Examples","text":"<pre><code>#DW persistentdw name=shared-data requires=copy-offload\n</code></pre> <p>This directive will cause the <code>shared-data</code> persistent storage allocation to be mounted onto the compute nodes for the job application to use. The copy-offload daemon will be started on the compute nodes so the application can request data movement during the application run.</p>"},{"location":"guides/user-interactions/readme/#copy_incopy_out","title":"copy_in/copy_out","text":"<p>The <code>copy_in</code> and <code>copy_out</code> directives are used to move data to and from the storage allocations on Rabbit nodes. The <code>copy_in</code> directive requests that data be moved into the Rabbit file system before application launch, and the <code>copy_out</code> directive requests data to be moved off of the Rabbit file system after application exit. This is different from data-movement that is requested through the copy-offload API, which occurs during application runtime. Multiple <code>copy_in</code> and <code>copy_out</code> directives can be included in the same job script. More information about data movement can be found in the Data Movement documentation.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_4","title":"Command Arguments","text":"Argument Required Value Notes <code>source</code> Yes <code>[path]</code>, <code>$DW_JOB_[name]/[path]</code>, <code>$DW_PERSISTENT_[name]/[path]</code> <code>[name]</code> is the name of the Rabbit persistent or job storage as specified in the <code>name</code> argument of the <code>jobdw</code> or <code>persistentdw</code> directive. Any <code>'-'</code> in the name from the <code>jobdw</code> or <code>persistentdw</code> directive should be changed to a <code>'_'</code> in the <code>copy_in</code> and <code>copy_out</code> directive. <code>destination</code> Yes <code>[path]</code>, <code>$DW_JOB_[name]/[path]</code>, <code>$DW_PERSISTENT_[name]/[path]</code> <code>[name]</code> is the name of the Rabbit persistent or job storage as specified in the <code>name</code> argument of the <code>jobdw</code> or <code>persistentdw</code> directive. Any <code>'-'</code> in the name from the <code>jobdw</code> or <code>persistentdw</code> directive should be changed to a <code>'_'</code> in the <code>copy_in</code> and <code>copy_out</code> directive. <code>profile</code> No Profile name This specifies which profile to use when copying data. Profiles specify the copy command to use, MPI arguments, and how output gets logged. If no profile is specified then the default profile is used. Profiles are created by an admin."},{"location":"guides/user-interactions/readme/#examples_4","title":"Examples","text":"<pre><code>#DW jobdw type=xfs capacity=10GiB name=fast-storage\n#DW copy_in source=/lus/backup/johndoe/important_data destination=$DW_JOB_fast_storage/data\n</code></pre> <p>This set of directives creates an xfs file system on the Rabbits for each compute node in the job, and then moves data from <code>/lus/backup/johndoe/important_data</code> to each of the xfs file systems. <code>/lus/backup</code> must be set up in the Rabbit software as a Global Lustre file system by an admin. The copy takes place before the application is launched on the compute nodes.</p> <pre><code>#DW persistentdw name=shared-data1\n#DW persistentdw name=shared-data2\n\n#DW copy_out source=$DW_PERSISTENT_shared_data1/a destination=$DW_PERSISTENT_shared_data2/a profile=no-xattr\n#DW copy_out source=$DW_PERSISTENT_shared_data1/b destination=$DW_PERSISTENT_shared_data2/b profile=no-xattr\n</code></pre> <p>This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the <code>no-xattr</code> profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a deterministic order.</p> <pre><code>#DW persistentdw name=shared-data\n#DW jobdw type=lustre capacity=1TiB name=fast-storage profile=high-metadata\n\n#DW copy_in source=/lus/shared/johndoe/shared-libraries destination=$DW_JOB_fast_storage/libraries\n#DW copy_in source=$DW_PERSISTENT_shared_data/ destination=$DW_JOB_fast_storage/data\n\n#DW copy_out source=$DW_JOB_fast_storage/data destination=/lus/backup/johndoe/very_important_data profile=no-xattr\n</code></pre> <p>This set of directives makes use of a persistent storage allocation and a job storage allocation. There are two <code>copy_in</code> directives, one that copies data from the global lustre file system to the job allocation, and another that copies data from the persistent allocation to the job allocation. These copies do not occur in a deterministic order. The <code>copy_out</code> directive occurs after the application has exited, and copies data from the Rabbit job storage to a global lustre file system.</p>"},{"location":"guides/user-interactions/readme/#container","title":"container","text":"<p>The <code>container</code> directive is used to launch user containers on the Rabbit nodes. The containers have access to <code>jobdw</code>, <code>persistentdw</code>, or global Lustre storage as specified in the <code>container</code> directive. More documentation for user containers can be found in the User Containers guide. Only a single <code>container</code> directive is allowed in a job.</p>"},{"location":"guides/user-interactions/readme/#command-arguments_5","title":"Command Arguments","text":"Argument Required Value Notes <code>name</code> Yes Lowercase string including numbers and '-' This is a name for the container instance that is unique within a job <code>profile</code> Yes Profile name This specifies which container profile to use. The container profile contains information about which container to run, which file system types to expect, which network ports are needed, and many other options. An admin is responsible for creating the container profiles. <code>DW_JOB_[expected]</code> No <code>jobdw</code> storage allocation <code>name</code> The container profile will list <code>jobdw</code> file systems that the container requires. <code>[expected]</code> is the name as specified in the container profile <code>DW_PERSISTENT_[expected]</code> No <code>persistentdw</code> storage allocation <code>name</code> The container profile will list <code>persistentdw</code> file systems that the container requires. <code>[expected]</code> is the name as specified in the container profile <code>DW_GLOBAL_[expected]</code> No Global lustre path The container profile will list global Lustre file systems that the container requires. <code>[expected]</code> is the name as specified in the container profile"},{"location":"guides/user-interactions/readme/#examples_5","title":"Examples","text":"<pre><code>#DW jobdw type=xfs capacity=10GiB name=fast-storage\n#DW container name=backup profile=automatic-backup DW_JOB_source=fast-storage DW_GLOBAL_destination=/lus/backup/johndoe\n</code></pre> <p>These directives create an xfs Rabbit job allocation and specify a container that should run on the Rabbit nodes. The container profile specified two file systems that the container needs, <code>DW_JOB_source</code> and <code>DW_GLOBAL_destination</code>. <code>DW_JOB_source</code> requires a <code>jobdw</code> file system and <code>DW_GLOBAL_destination</code> requires a global Lustre file system. </p>"},{"location":"guides/user-interactions/readme/#environment-variables","title":"Environment Variables","text":"<p>The WLM makes a set of environment variables available to the job application running on the compute nodes that provide Rabbit specific information. These environment variables are used to find the mount location of Rabbit file systems and port numbers for user containers.</p> Environment Variable Value Notes <code>DW_JOB_[name]</code> Mount path of a <code>jobdw</code> file system <code>[name]</code> is from the <code>name</code> argument in the <code>jobdw</code> directive. Any <code>'-'</code> characters in the <code>name</code> will be converted to <code>'_'</code> in the environment variable. There will be one of these environment variables per <code>jobdw</code> directive in the job. <code>DW_PERSISTENT_[name]</code> Mount path of a <code>persistentdw</code> file system <code>[name]</code> is from the <code>name</code> argument in the <code>persistentdw</code> directive. Any <code>'-'</code> characters in the <code>name</code> will be converted to <code>'_'</code> in the environment variable. There will be one of these environment variables per <code>persistentdw</code> directive in the job. <code>NNF_CONTAINER_PORTS</code> Comma separated list of ports These ports are used together with the IP address of the local Rabbit to communicate with a user container specified by a <code>container</code> directive. More information can be found in the User Containers guide."},{"location":"repo-guides/readme/","title":"Repo Guides","text":""},{"location":"repo-guides/readme/#management","title":"Management","text":"<ul> <li>Releasing NNF Software</li> </ul>"},{"location":"repo-guides/release-nnf-sw/readme/","title":"Releasing NNF Software","text":""},{"location":"repo-guides/release-nnf-sw/readme/#nnf-software-overview","title":"NNF Software Overview","text":"<p>The following repositories comprise the NNF Software and each have their own versions. There is a hierarchy, since <code>nnf-deploy</code> packages the individual components together using submodules.</p> <p>Each component under <code>nnf-deploy</code> needs to be released first, then <code>nnf-deploy</code> can be updated to point to those release versions, then <code>nnf-deploy</code> itself can be updated and released.</p> <p>The documentation repo (NearNodeFlash/NearNodeFlash.github.io) is released separately and is not part of <code>nnf-deploy</code>, but it should match the version number of <code>nnf-deploy</code>. Release this like the other components.</p> <ul> <li> <p>NearNodeFlash/nnf-deploy</p> <ul> <li>DataWorkflowServices/dws</li> <li>HewlettPackard/lustre-csi-driver</li> <li>NearNodeFlash/lustre-fs-operator</li> <li>NearNodeFlash/nnf-mfu</li> <li>NearNodeFlash/nnf-sos</li> <li>NearNodeFlash/nnf-dm</li> <li>NearNodeFlash/nnf-integration-test</li> </ul> </li> <li> <p>NearNodeFlash/NearNodeFlash.github.io</p> </li> </ul> <p>nnf-ec is vendored in as part of <code>nnf-sos</code> and does not need to be released separately.</p>"},{"location":"repo-guides/release-nnf-sw/readme/#primer","title":"Primer","text":"<p>This document is based on the process set forth by the DataWorkflowServices Release Process. Please read that as a background for this document before going any further.</p>"},{"location":"repo-guides/release-nnf-sw/readme/#requirements","title":"Requirements","text":"<p>To create tags and releases, you will need maintainer or admin rights on the repos.</p>"},{"location":"repo-guides/release-nnf-sw/readme/#release-each-component-in-nnf-deploy","title":"Release Each Component In <code>nnf-deploy</code>","text":"<p>You'll first need to create releases for each component contained in <code>nnf-deploy</code>. This section describes that process.</p> <p>Each release branch needs to be updated with what is on master. To do that, we'll need the latest copy of master, and it will ultimately be merged to the <code>releases/v0</code> branch via a Pull Request. Once merged, an annotated tag is created and then a release.</p> <p>Each component has its own version number that needs to be incremented. Make sure you change the version numbers in the commands below to match the new version for the component. The <code>v0.0.3</code> is just an example.</p> <ol> <li> <p>Ensure your branches are up to date:</p> <pre><code>git checkout master\ngit pull\ngit checkout releases/v0\ngit pull\n</code></pre> </li> <li> <p>Create a branch to merge into the release branch:</p> <pre><code>git checkout -b release-v0.0.3\n</code></pre> </li> <li> <p>Merge in the updates from the <code>master</code> branch. There should not be any conflicts, but it's    not unheard of. Tread carefully if there are conflicts.</p> <pre><code>git merge master\n</code></pre> </li> <li> <p>Verify that there are no differences between your branch and the master branch:</p> <pre><code>git diff master\n</code></pre> <p>If there are any differences, they must be trivial. Some READMEs may have extra lines at the end.</p> </li> <li> <p>Perform repo-specific updates:</p> <ol> <li>For <code>lustre-csi-driver</code>, <code>lustre-fs-operator</code>, <code>dws</code>, <code>nnf-sos</code>, and <code>nnf-dm</code> there are additional files that need to track the version number as well, which allow them to be installed with <code>kubectl apply -k</code>.</li> </ol> Repo Update <code>nnf-mfu</code> The new version of <code>nnf-mfu</code> is referenced by the <code>NNFMFU</code> variable in several places:<code>nnf-sos</code>1. <code>Makefile</code> replace <code>NNFMFU</code> with <code>nnf-mfu's</code> tag.<code>nnf-dm</code>1. In <code>Dockerfile</code> and <code>Makefile</code>, replace <code>NNFMFU_VERSION</code> with the new version.2. In <code>config/manager/kustomization.yaml</code>, replace <code>nnf-mfu</code>'s <code>newTag: &lt;X.Y.Z&gt;.</code><code>nnf-deploy</code>1. In <code>config/repositories.yaml</code> replace <code>NNFMFU_VERSION</code> with the new version. <code>lustre-fs-operator</code> update <code>config/manager/kustomization.yaml</code> with the correct version.<code>nnf-deploy</code>1. In <code>config/repositories.yaml</code> replace the lustre-fs-operator version. <code>dws</code> update <code>config/manager/kustomization.yaml</code> with the correct version. <code>nnf-sos</code> update <code>config/manager/kustomization.yaml</code> with the correct version. <code>nnf-dm</code> update <code>config/manager/kustomization.yaml</code> with the correct version. <code>lustre-csi-driver</code> update <code>deploy/kubernetes/base/kustomization.yaml</code> and <code>charts/lustre-csi-driver/values.yaml</code> with the correct version.<code>nnf-deploy</code>1. In <code>config/repositories.yaml</code> replace the lustre-csi-driver version. </li> <li> <p>Target the <code>releases/v0</code> branch with a Pull Request from your branch.  When merging the Pull Request, you must use a Merge Commit.</p> <p>Note</p> <p>Do not Rebase or Squash! Those actions remove the records that Git uses to determine which commits have been merged, and then when the next release is created Git will treat everything like a conflict. Additionally, this will cause auto-generated release notes to include the previous release.</p> </li> <li> <p>Once merged, update the release branch locally and create an annotated tag. Each repo has a workflow job named <code>create_release</code> that will create a release automatically when the new tag is pushed.</p> <pre><code>git checkout releases/v0\ngit pull\ngit tag -a v0.0.3 -m \"Release v0.0.3\"\ngit push origin --tags\n</code></pre> </li> <li> <p>GOTO Step 1 and repeat this process for each remaining component.</p> </li> </ol>"},{"location":"repo-guides/release-nnf-sw/readme/#release-nnf-deploy","title":"Release <code>nnf-deploy</code>","text":"<p>Once the individual components are released, we need to update the submodules in <code>nnf-deploy's</code> <code>master</code> branch before we create the release branch. This ensures that everything is current on <code>master</code> for <code>nnf-deploy</code>.</p> <ol> <li> <p>Update the submodules for <code>nnf-deploy</code> on master:</p> <pre><code>cd nnf-deploy\ngit checkout master\ngit pull\ngit submodule foreach git checkout master\ngit submodule foreach git pull\n</code></pre> </li> <li> <p>Create a branch to capture the submodule changes for the PR to <code>master</code></p> <pre><code>git checkout -b update-submodules\n</code></pre> </li> <li> <p>Commit the changes and open a Pull Request against the <code>master</code> branch.</p> </li> <li> <p>Once merged, follow steps 1-3 from the previous section to create a release branch off of <code>releases/v0</code> and    update it with changes from <code>master</code>.</p> </li> <li> <p>There will be conflicts for the submodules after step 3. This is expected. Update the    submodules to the new tags and then commit the changes.  If each tag was committed properly, the    following command can do this for you:</p> <pre><code>git submodule foreach 'git checkout `git describe --match=\"v*\" HEAD`'\n</code></pre> </li> <li> <p>Add each submodule to the commit with <code>git add</code>.</p> </li> <li> <p>Verify that each submodule is now at the proper tagged version.</p> <pre><code>git submodule\n</code></pre> </li> <li> <p>Update <code>config/repositories.yaml</code> with the referenced versions for:</p> <ol> <li><code>lustre-csi-driver</code></li> <li><code>lustre-fs-operator</code></li> <li><code>nnf-mfu</code>  (Search for NNFMFU_VERSION)</li> </ol> </li> <li> <p>Tidy and make <code>nnf-deploy</code> to avoid embarrassment.</p> <pre><code>go mod tidy\nmake\n</code></pre> </li> <li> <p>Do another <code>git add</code> for any changes, particularly <code>go.mod</code> and/or <code>go.sum</code>.</p> </li> <li> <p>Verify that <code>git status</code> is happy with <code>nnf-deploy</code> and then finalize the merge     from master by with a <code>git commit</code>.</p> </li> <li> <p>Follow steps 6-7 from the previous section to finalize the release of <code>nnf-deploy</code>.</p> </li> </ol>"},{"location":"repo-guides/release-nnf-sw/readme/#release-nearnodeflashgithubio","title":"Release <code>NearNodeFlash.github.io</code>","text":"<p>Please review and update the documentation for changes you may have made.</p> <p>After nnf-deploy has a release tag, you may release the documentation. Use the same steps found above in \"Release Each Component\". Note that the default branch for this repo is \"main\" instead of \"master\".</p> <p>Give this release a tag that matches the nnf-deploy release, to show that they go together. Create the release by using the \"Create release\" or \"Draft a new release\" button in the GUI, or by using the <code>gh release create</code> CLI command. Whether using the GUI or the CLI, mark the release as \"latest\" and select the appropriate option to generate release notes.</p> <p>Wait for the <code>mike</code> tool in <code>.github/workflow/release.yaml</code> to finish building the new doc. You can check its status by going to the <code>gh-pages</code> branch in the repo. When you visit the release at https://nearnodeflash.github.io, you should see the new release in the drop-down menu and the new release should be the default display.</p> <p>The software is now released!</p>"},{"location":"repo-guides/release-nnf-sw/readme/#clone-a-release","title":"Clone a release","text":"<p>The follow commands clone release <code>v0.0.7</code> into <code>nnf-deploy-v0.0.7</code></p> <pre><code>export NNF_VERSION=v0.0.7\n\ngit clone --recurse-submodules git@github.com:NearNodeFlash/nnf-deploy nnf-deploy-$NNF_VERSION\ncd nnf-deploy-$NNF_VERSION\ngit -c advice.detachedHead=false checkout $NNF_VERSION --recurse-submodules\n\ngit submodule status\n</code></pre>"},{"location":"rfcs/","title":"Request for Comment","text":"<ol> <li> <p>Rabbit Request For Comment Process  - Published</p> </li> <li> <p>Rabbit Storage For Containerized Applications  - Published</p> </li> </ol>"},{"location":"rfcs/0001/readme/","title":"Rabbit Request For Comment Process","text":"<p>Rabbit software must be designed in close collaboration with our end-users. Part of this process involves open discussion in the form of Request For Comment (RFC) documents. The remainder of this document presents the RFC process for Rabbit.</p>"},{"location":"rfcs/0001/readme/#history-philosophy","title":"History &amp; Philosophy","text":"<p>NNF RFC documents are modeled after the long history of IETF RFC documents that describe the internet. The philosophy is captured best in RFC 3</p> <p>The content of a [...] note may be any thought, suggestion, etc. related to the HOST software or other aspect of the network.  Notes are encouraged to be timely rather than polished.  Philosophical positions without examples or other specifics, specific suggestions or implementation techniques without introductory or background explication, and explicit questions without any attempted answers are all acceptable.  The minimum length for a [...] note is one sentence.</p> <p>These standards (or lack of them) are stated explicitly for two reasons. First, there is a tendency to view a written statement as ipso facto authoritative, and we hope to promote the exchange and discussion of considerably less than authoritative ideas.  Second, there is a natural hesitancy to publish something unpolished, and we hope to ease this inhibition.</p>"},{"location":"rfcs/0001/readme/#when-to-create-an-rfc","title":"When to Create an RFC","text":"<p>New features, improvements, and other tasks that need to source feedback from multiple sources are to be written as Request For Comment (RFC) documents.</p>"},{"location":"rfcs/0001/readme/#metadata","title":"Metadata","text":"<p>At the start of each RFC, there must include a short metadata block that contains information useful for filtering and sorting existing documents. This markdown is not visible inside the document.</p> <pre><code>---\nauthors: John Doe &lt;john.doe@company.com&gt;, Jane Doe &lt;jane.doe@company.com&gt;\nstate: prediscussion|ideation|discussion|published|committed|abandoned\ndiscussion: (link to PR, if available)\n----\n</code></pre>"},{"location":"rfcs/0001/readme/#creation","title":"Creation","text":"<p>An RFC should be created at the next freely available 4-digit index the GitHub RFC folder. Create a folder for your RFC and write your RFC document as <code>readme.md</code> using standard Markdown. Include additional documents or images in the folder if needed.</p> <p>Add an entry to <code>/docs/rfcs/index.md</code></p> <p>Add an entry to <code>/mkdocs.yml</code> in the <code>nav[RFCs]</code> section</p>"},{"location":"rfcs/0001/readme/#push","title":"Push","text":"<p>Push your changes to your RFC branch</p> <pre><code>git add --all\ngit commit -s -m \"[####]: Your Request For Comment Document\"\ngit push origin ####\n</code></pre>"},{"location":"rfcs/0001/readme/#pull-request","title":"Pull Request","text":"<p>Submit a PR for your branch. This will open your RFC to comments. Add those individuals who are interested in your RFC as reviewers.</p>"},{"location":"rfcs/0001/readme/#merge","title":"Merge","text":"<p>Once consensus has been reached on your RFC, merge to main origin. </p>"},{"location":"rfcs/0002/readme/","title":"Rabbit storage for containerized applications","text":"<p>Note</p> <p>This RFC contains outdated information. For the most up-to-date details, please refer to the User Containers documentation.</p> <p>For Rabbit to provide storage to a containerized application there needs to be some mechanism. The remainder of this RFC proposes that mechanism.</p>"},{"location":"rfcs/0002/readme/#actors","title":"Actors","text":"<p>There are several actors involved:</p> <ul> <li>The AUTHOR of the containerized application</li> <li>The ADMINISTRATOR who works with the author to determine the application requirements for execution</li> <li>The USER who intends to use the application using the 'container' directive in their job specification</li> <li>The RABBIT software that interprets the #DWs and starts the container during execution of the job</li> </ul> <p>There are multiple relationships between the actors:</p> <ul> <li>AUTHOR to ADMINISTRATOR: The author tells the administrator how their application is executed and the NNF storage requirements.</li> <li>Between the AUTHOR and USER: The application expects certain storage, and the #DW must meet those expectations.</li> <li>ADMINISTRATOR to RABBIT: Admin tells Rabbit how to run the containerized application with the required storage.</li> <li>Between USER and RABBIT: User provides the #DW container directive in the job specification. Rabbit validates and interprets the directive.</li> </ul>"},{"location":"rfcs/0002/readme/#proposal","title":"Proposal","text":"<p>The proposal below outlines the high level behavior of running containers in a workflow:</p> <ol> <li>The AUTHOR writes their application expecting NNF Storage at specific locations. For each storage requirement, they define:<ol> <li>a unique name for the storage which can be referenced in the 'container' directive</li> <li>the required mount path or mount path prefix</li> <li>other constraints or storage requirements (e.g. minimum capacity)</li> </ol> </li> <li>The AUTHOR works with the ADMINISTRATOR to define:<ol> <li>a unique name for the program to be referred by USER</li> <li>the pod template or MPI Job specification for executing their program</li> <li>the NNF storage requirements described above.</li> </ol> </li> <li>The ADMINISTRATOR creates a corresponding NNF Container Profile Kubernetes custom resource with the necessary NNF storage requirements and pod specification as described by the AUTHOR</li> <li>The USER who desires to use the application works with the AUTHOR and the related NNF Container Profile to understand the storage requirements</li> <li>The USER submits a WLM job with the #DW container directive variables populated</li> <li>WLM runs the workflow and drives it through the following stages...<ol> <li><code>Proposal</code>: RABBIT validates the #DW container directive by comparing the supplied values to those listed in the NNF Container Profile. If the workflow fails to meet the requirements, the job fails</li> <li><code>PreRun</code>: RABBIT software:<ol> <li>duplicates the pod template specification from the Container Profile and patches the necessary Volumes and the config map. The spec is used as the basis for starting the necessary pods and containers</li> <li>creates a config map reflecting the storage requirements and any runtime parameters; this is provided to the container at the volume mount named <code>nnf-config</code>, if specified</li> </ol> </li> <li>The containerized application(s) executes. The expected mounts are available per the requirements and celebration occurs. The pods continue to run until:</li> <li>a pod completes successfully (any failed pods will be retried)</li> <li>the max number of pod retries is hit (indicating failure on all retry attempts)<ol> <li>Note: retry limit is non-optional per Kubernetes configuration</li> <li>If retries are not desired, this number could be set to 0 to disable any retry attempts</li> </ol> </li> <li><code>PostRun</code>: RABBIT software:</li> <li>marks the stage as <code>Ready</code> if the pods have all completed successfully. This includes a successful retry after preceding failures</li> <li>starts a timer for any running pods. Once the timeout is hit, the pods will be killed and the workflow will indicate failure</li> <li>leaves all pods around for log inspection</li> </ol> </li> </ol>"},{"location":"rfcs/0002/readme/#container-assignment-to-rabbit-nodes","title":"Container Assignment to Rabbit Nodes","text":"<p>During <code>Proposal</code>, the USER must assign compute nodes for the container workflow. The assigned compute nodes determine which Rabbit nodes run the containers.</p>"},{"location":"rfcs/0002/readme/#container-definition","title":"Container Definition","text":"<p>Containers can be launched in two ways:</p> <ol> <li>MPI Jobs</li> <li>Non-MPI Jobs</li> </ol> <p>MPI Jobs are launched using <code>mpi-operator</code>. This uses a launcher/worker model. The launcher pod is responsible for running the <code>mpirun</code> command that will target the worker pods to run the MPI application. The launcher will run on the first targeted NNF node and the workers will run on each of the targeted NNF nodes.</p> <p>For Non-MPI jobs, <code>mpi-operator</code> is not used. This model runs the same application on each of the targeted NNF nodes.</p> <p>The NNF Container Profile allows a user to pick one of these methods. Each method is defined in similar, but different fashions. Since MPI Jobs use <code>mpi-operator</code>, the <code>MPIJobSpec</code> is used to define the container(s). For Non-MPI Jobs a <code>PodSpec</code> is used to define the container(s).</p> <p>An example of an MPI Job is below. The <code>data.mpiSpec</code> field is defined:</p> <pre><code>kind: NnfContainerProfile\napiVersion: nnf.cray.hpe.com/v1alpha1\ndata:\n  mpiSpec:\n    mpiReplicaSpecs:\n      Launcher:\n        template:\n          spec:\n            containers:\n            - command:\n              - mpirun\n              - dcmp\n              - $(DW_JOB_foo_local_storage)/0\n              - $(DW_JOB_foo_local_storage)/1\n              image: ghcr.io/nearnodeflash/nnf-mfu:latest\n              name: example-mpi\n      Worker:\n        template:\n          spec:\n            containers:\n            - image: ghcr.io/nearnodeflash/nnf-mfu:latest\n              name: example-mpi\n    slotsPerWorker: 1\n...\n</code></pre> <p>An example of a Non-MPI Job is below. The <code>data.spec</code> field is defined:</p> <pre><code>kind: NnfContainerProfile\napiVersion: nnf.cray.hpe.com/v1alpha1\ndata:\n  spec:\n    containers:\n    - command:\n      - /bin/sh\n      - -c\n      - while true; do date &amp;&amp; sleep 5; done\n      image: alpine:latest\n      name: example-forever\n...\n</code></pre> <p>In both cases, the <code>spec</code> is used as a starting point to define the containers. NNF software supplements the specification to add functionality (e.g. mounting #DW storages). In other words, what you see here will not be the final spec for the container that ends up running as part of the container workflow.</p>"},{"location":"rfcs/0002/readme/#security","title":"Security","text":"<p>The workflow's UID and GID are used to run the container application and for mounting the specified fileystems in the container. Kubernetes allows for a way to define permissions for a container using a Security Context.</p> <p><code>mpirun</code> uses <code>ssh</code> to communicate with the worker nodes. <code>ssh</code> requires that UID is assigned to a username. Since the UID/GID are dynamic values from the workflow, work must be done to the container's <code>/etc/passwd</code> to map the UID/GID to a username. An <code>InitContainer</code> is used to modify <code>/etc/passwd</code> and mount it into the container.</p>"},{"location":"rfcs/0002/readme/#communication-details","title":"Communication Details","text":"<p>The following subsections outline the proposed communication between the Rabbit nodes themselves and the Compute nodes.</p>"},{"location":"rfcs/0002/readme/#rabbit-to-rabbit-communication","title":"Rabbit-to-Rabbit Communication","text":""},{"location":"rfcs/0002/readme/#non-mpi-jobs","title":"Non-MPI Jobs","text":"<p>Each rabbit node can be reached via <code>&lt;hostname&gt;.&lt;subdomain&gt;</code> using DNS. The hostname is the Rabbit node name and the workflow name is used for the subdomain.</p> <p>For example, a workflow name of <code>foo</code> that targets <code>rabbit-node2</code> would be <code>rabbit-node2.foo</code>.</p> <p>Environment variables are provided to the container and ConfigMap for each rabbit that is targeted by the container workflow:</p> <pre><code>NNF_CONTAINER_NODES=rabbit-node2 rabbit-node3\nNNF_CONTAINER_SUBDOMAIN=foo\nNNF_CONTAINER_DOMAIN=default.svc.cluster.local\n</code></pre> <pre><code>kind: ConfigMap\napiVersion: v1\ndata:\n  nnfContainerNodes:\n    - rabbit-node2\n    - rabbit-node3\n  nnfContainerSubdomain: foo\n  nnfContainerDomain: default.svc.cluster.local\n</code></pre> <p>DNS can then be used to communicate with other Rabbit containers. The FQDN for the container running on rabbit-node2 is <code>rabbit-node2.foo.default.svc.cluster.local</code>.</p>"},{"location":"rfcs/0002/readme/#mpi-jobs","title":"MPI Jobs","text":"<p>For MPI Jobs, these hostnames and subdomains will be slightly different due to the implementation of <code>mpi-operator</code>. However, the variables will remain the same and provide a consistent way to retrieve the values.</p>"},{"location":"rfcs/0002/readme/#compute-to-rabbit-communication","title":"Compute-to-Rabbit Communication","text":"<p>For Compute to Rabbit communication, the proposal is to use an open port between the nodes, so the applications could communicate using IP protocol.  The port number would be assigned by the Rabbit software and included in the workflow resource's environmental variables after the Setup state (similar to workflow name &amp; namespace).  Flux should provide the port number to the compute application via an environmental variable or command line argument. The containerized application would always see the same port number using the <code>hostPort</code>/<code>containerPort</code> mapping functionality included in Kubernetes. To clarify, the Rabbit software is picking and managing the ports picked for <code>hostPort</code>.</p> <p>This requires a range of ports to be open in the firewall configuration and specified in the rabbit system configuration. The fewer the number of ports available increases the chances of a port reservation conflict that would fail a workflow.</p> <p>Example port range definition in the SystemConfiguration:</p> <pre><code>apiVersion: v1\nitems:\n  - apiVersion: dws.cray.hpe.com/v1alpha1\n    kind: SystemConfiguration\n      name: default\n      namespace: default\n    spec:\n      containerHostPortRangeMin: 30000\n      containerHostPortRangeMax: 40000\n      ...\n</code></pre>"},{"location":"rfcs/0002/readme/#example","title":"Example","text":"<p>For this example, let's assume I've authored an application called <code>foo</code>. This application requires Rabbit local GFS2 storage and a persistent Lustre storage volume.</p> <p>Working with an administrator, my application's storage requirements and pod specification are placed in an NNF Container Profile <code>foo</code>:</p> <pre><code>kind: NnfContainerProfile\napiVersion: v1alpha1\nmetadata:\n    name: foo\n    namespace: default\nspec:\n    postRunTimeout: 300\n    maxRetries: 6\n    storages:\n    - name: DW_JOB_foo-local-storage\n      optional: false\n    - name: DW_PERSISTENT_foo-persistent-storage\n      optional: false\n    spec:\n        containers:\n        - name: foo\n          image: foo:latest\n          command:\n          - /foo\n          ports:\n          - name: compute\n            containerPort: 80\n</code></pre> <p>Say Peter wants to use <code>foo</code> as part of his job specification. Peter would submit the job with the directives below:</p> <pre><code>#DW jobdw name=my-gfs2 type=gfs2 capacity=1TB\n\n#DW persistentdw name=some-lustre\n\n#DW container name=my-foo profile=foo                 \\\n    DW_JOB_foo-local-storage=my-gfs2                  \\\n    DW_PERSISTENT_foo-persistent-storage=some-lustre\n</code></pre> <p>Since the NNF Container Profile has specified that both storages are not optional (i.e. <code>optional: false</code>), they must both be present in the #DW directives along with the <code>container</code> directive. Alternatively, if either was marked as optional (i.e. <code>optional: true</code>), it would not be required to be present in the #DW directives and therefore would not be mounted into the container.</p> <p>Peter submits the job to the WLM. WLM guides the job through the workflow states:</p> <ol> <li>Proposal: Rabbit software verifies the #DW directives. For the container directive <code>my-foo</code> with profile <code>foo</code>, the storage requirements listed in the NNF Container Profile are <code>foo-local-storage</code> and <code>foo-persistent-storage</code>. These values are correctly represented by the directive so it is valid.</li> <li>Setup: Since there is a jobdw, <code>my-gfs2</code>, Rabbit software provisions this storage.</li> <li> <p>Pre-Run:</p> <ol> <li> <p>Rabbit software generates a config map that corresponds to the storage requirements and runtime parameters.</p> <pre><code>    kind: ConfigMap\n    apiVersion: v1\n    metadata:\n        name: my-job-container-my-foo\n    data:\n        DW_JOB_foo_local_storage:             mount-type=indexed-mount\n        DW_PERSISTENT_foo_persistent_storage: mount-type=mount-point\n        ...\n</code></pre> </li> <li> <p>Rabbit software creates a pod and duplicates the <code>foo</code> pod spec in the NNF Container Profile and fills in the necessary volumes and config map.</p> <pre><code>    kind: Pod\n    apiVersion: v1\n    metadata:\n        name: my-job-container-my-foo\n    template:\n        metadata:\n            name: foo\n            namespace: default\n        spec:\n            containers:\n            # This section unchanged from Container Profile\n            - name: foo\n              image: foo:latest\n              command:\n                - /foo\n              volumeMounts:\n              - name: foo-local-storage\n                mountPath: &lt;MOUNT_PATH&gt;\n              - name: foo-persistent-storage\n                mountPath: &lt;MOUNT_PATH&gt;\n              - name: nnf-config\n                mountPath: /nnf/config\n              ports:\n                - name: compute\n                  hostPort: 9376 # hostport selected by Rabbit software\n                  containerPort: 80\n\n            # volumes added by Rabbit software\n            volumes:\n            - name: foo-local-storage\n              hostPath:\n                path: /nnf/job/my-job/my-gfs2\n            - name: foo-persistent-storage\n              hostPath:\n                path: /nnf/persistent/some-lustre\n            - name: nnf-config\n              configMap:\n                name: my-job-container-my-foo\n\n            # securityContext added by Rabbit software - values will be inherited from the workflow\n            securityContext:\n              runAsUser: 1000\n              runAsGroup: 2000\n              fsGroup: 2000\n</code></pre> </li> <li> <p>Rabbit software starts the pods on Rabbit nodes</p> </li> <li>Post-Run</li> <li>Rabbit waits for all pods to finish (or until timeout is hit)</li> <li>If all pods are successful, Post-Run is marked as <code>Ready</code></li> <li>If any pod is not successful, Post-Run is not marked as <code>Ready</code></li> </ol> </li> </ol>"},{"location":"rfcs/0002/readme/#special-note-indexed-mount-type-for-gfs2-file-systems","title":"Special Note: Indexed-Mount Type for GFS2 File Systems","text":"<p>When using a GFS2 file system, each compute is allocated its own Rabbit volume. The Rabbit software mounts a collection of mount paths with a common prefix and an ending indexed value.</p> <p>Application AUTHORS must be aware that their desired mount-point really contains a collection of directories, one for each compute node. The mount point type can be known by consulting the config map values.</p> <p>If we continue the example from above, the <code>foo</code> application expects the foo-local-storage path of <code>/foo/local</code> to contain several directories</p> <pre><code>$ ls /foo/local/*\n\nnode-0\nnode-1\nnode-2\n...\nnode-N\n</code></pre> <p>Node positions are not absolute locations. WLM could, in theory, select 6 physical compute nodes at physical location 1, 2, 3, 5, 8, 13, which would appear as directories <code>/node-0</code> through <code>/node-5</code> in the container path.</p> <p>Symlinks will be added to support the physical compute node names. Assuming a compute node hostname of <code>compute-node-1</code> from the example above, it would link to <code>node-0</code>, <code>compute-node-2</code> would link to <code>node-1</code>, etc.</p> <p>Additionally, not all container instances could see the same number of compute nodes in an indexed-mount scenario. If 17 compute nodes are required for the job, WLM may assign 16 nodes to run one Rabbit, and 1 node to another Rabbit.</p>"}]}
\ No newline at end of file
diff --git a/dev/sitemap.xml.gz b/dev/sitemap.xml.gz
index a900e89..2fdcf96 100644
Binary files a/dev/sitemap.xml.gz and b/dev/sitemap.xml.gz differ