Skip to content

Commit

Permalink
feat: Install Nvidia DOCA on the servers post provisioning
Browse files Browse the repository at this point in the history
Signed-off-by: Boris Glimcher <[email protected]>
  • Loading branch information
glimchb committed Jan 16, 2024
1 parent 9f00d11 commit b0fcb7a
Show file tree
Hide file tree
Showing 31 changed files with 509 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This playbook achieves the following tasks:

* Configures a docker registry to pull images from the internet and store them locally

* Optionally installs OFED and CUDA
* Optionally installs OFED, DOCA and CUDA

.. toctree::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ Optional configurations managed by the provision tool
* CUDA requires an additional reboot while being installed. While this is taken care of by Omnia, users are required to wait an additional few minutes when running the provision tool with CUDA installation for the target nodes to come up.


**Installing DOCA**

**Using the provision tool**

* If ``nvidia_doca_path`` is provided in ``input/provision_config.yml`` and Nvidia DPUs are available on the target nodes, DOCA packages will be deployed post provisioning without user intervention.

**Using the Network playbook**

* DOCA can also be installed using `network.yml <../../Roles/Network/index.html>`_ after provisioning the servers (Assuming the provision tool did not install DOCA packages).

.. note:: The DOCA package can be downloaded from `here <https://developer.nvidia.com/networking/doca>`_ .

**Installing OFED**

**Using the provision tool**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,10 @@ Fill in all provision-specific parameters in ``input/provision_config.yml``
| ``string`` | |
| Optional | |
+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nvidia_doca_path | Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: "/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm" |
| ``string`` | |
| Optional | |
+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. note::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,14 @@ Note the compatibility between cluster OS and control plane OS below:

.. [1] Ensure that control planes running RHEL have an active subscription or are configured to access local repositories. The following repositories should be enabled on the control plane: **AppStream**, **Code Ready Builder (CRB)**, **BaseOS**. For RHEL control planes running 8.5 and below, ensure that sshpass is additionally available to install or download to the control plane (from any local repository).
* To **optionally** set up CUDA and OFED using the provisioning tool, download the required repositories to the control plane from here to deploy on the target nodes:
* To **optionally** set up CUDA, DOCA and OFED using the provisioning tool, download the required repositories to the control plane from here to deploy on the target nodes:

1. `For NVIDIA GPUs: <https://developer.nvidia.com/cuda-downloads/>`_: CUDA is a parallel computing platform and application programming interface that allows software to use certain types of graphics processing units for general purpose processing, an approach called general-purpose computing on GPUs.

2. `For Mellanox <https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_: OFED (OpenFabrics Enterprise Distribution) is open-source software for RDMA and kernel bypass applications. OFED can be used in business, research and scientific environments that require highly efficient networks, storage connectivity and parallel computing.

3. `For NVIDIA DPUs: <https://developer.nvidia.com/networking/doca/>`_: DOCA is ...

* Ensure that all connection names under the network manager match their corresponding device names.
To verify network connection names: ::

Expand Down
1 change: 1 addition & 0 deletions docs/source/InstallationGuides/addinganewnode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ While adding a new node to the cluster, users can modify the following:
- The operating system
- CUDA
- OFED
- DOCA

A new node can be added using the following ways:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ In the event that an existing Omnia cluster needs a different OS version or a fr
- The operating system
- CUDA
- OFED
- DOCA

Omnia can re-provision the cluster by running the following command: ::

Expand Down
2 changes: 2 additions & 0 deletions docs/source/Overview/SupportMatrix/omniainstalledsoftware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,8 @@ Software Installed by Omnia
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MLNX-OFED | BSD License | MLNX_OFED is an NVIDIA tested and packaged version of OFED that supports two interconnect types using the same RDMA (remote DMA) and kernel bypass APIs called OFED verbs – InfiniBand and Ethernet. |
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NVIDIA DOCA | NVIDIA License | The NVIDIA® DOCA® is the key to unlocking the potential of the NVIDIA® BlueField® networking platform to offload, accelerate, and isolate data center workloads. |
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ansible pylibssh | LGPL 2.1 | Python bindings to client functionality of libssh specific to Ansible use case. |
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| perl-DBD-Pg | GNU General Public License v3 | DBD::Pg - PostgreSQL database driver for the DBI module |
Expand Down
12 changes: 11 additions & 1 deletion docs/source/Roles/Network/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ Some of the network features Omnia offers are:

2. Infiniband switch configuration

To install OFED drivers, enter all required parameters in ``input/network_config.yml``:
3. Nvidia DOCA

To install OFED and DOCA drivers, enter all required parameters in ``input/network_config.yml``:


+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Expand All @@ -37,6 +39,14 @@ To install OFED drivers, enter all required parameters in ``input/network_config
| | * ``false`` <- Default |
| | * ``true`` |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nvidia_doca_offline_path | Absolute path to local copy of rpm file containing DOCA package. The package can be downloaded from https://developer.nvidia.com/networking/doca/. |
| [optional] | |
| ``string`` | |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nvidia_doca_version | Indicates the version of DOCA to be downloaded. If ``nvidia_doca_offline_path`` is not given, declaring this variable is mandatory. |
| [optional] | |
| ``string`` | **Default value**: 2.5.0-0.0.1.23.10.1.1.9.0 |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

To run the script: ::

Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/bmc.csv
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/mapping.csv
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/snmpwalk.csv
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/switch-based.csv
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
10 changes: 10 additions & 0 deletions input/network_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,13 @@ mlnx_ofed_version: 5.4-2.4.1.3
# Mandatory variable
# Default value: true
mlnx_ofed_add_kernel_support: true

# Absolute path to local copy of .tgz file containing DOCA package.
# The package can be downloaded from https://developer.nvidia.com/networking/doca/
# Optional variable.
nvidia_doca_offline_path: ""

# If nvidia_doca_offline_path is not given, declaring this variable is mandatory.
# The DOCA package is downloaded as per version mentioned in this variable.
# Default value: 2.5.0
nvidia_doca_version: 2.5.0
8 changes: 8 additions & 0 deletions input/provision_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,14 @@ mlnx_ofed_path: ""
# cuda_toolkit_path: "/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"
cuda_toolkit_path: ""

#### Optional, discovery_mechanism: mapping or switch_based or bmc or snmpwalk
# Absolute path to local copy of .rpm file containing DOCA packages.
# The cuda rpm can be downloaded from https://developer.nvidia.com/networking/doca
# DOCA will be installed post provisioning without any user intervention requirement
# Example:
# nvidia_doca_path: "/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"
nvidia_doca_path: ""

#### Mandatory, discovery_mechanism: mapping or switch_based or bmc or snmpwalk
# apptainer will be installed on the cluster to enable execution of HPC benchmarks in a containeraized environment.
# If apptainer_support: false, apptainer will not be installed on the cluster
Expand Down
10 changes: 10 additions & 0 deletions network/network.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@
name: mlnx_ofed
tasks_from: validations.yml

- name: Validate input parameters for nvidia_doca
hosts: localhost
connection: local
gather_facts: true
tasks:
- name: Validate variables from network_config.yml
ansible.builtin.include_role:
name: nvidia_doca
tasks_from: validations.yml

- name: Check nodes having Infiniband Support
hosts: all
tasks:
Expand Down
18 changes: 18 additions & 0 deletions network/roles/nvidia_doca/tasks/install_doca_leap.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Copyright 2024 Dell Inc. or its subsidiaries. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---

- name: Install doca using offline path
ansible.builtin.debug:
msg: "Doca Installation on leap is not yet supported."
53 changes: 53 additions & 0 deletions network/roles/nvidia_doca/tasks/install_doca_redhat.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright 2024 Dell Inc. or its subsidiaries. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---

- name: Install doca using offline path
block:
- name: Install packages from doca rpm file
ansible.builtin.yum:
name: "{{ doca_filepath }}"
state: present
disable_gpg_check: true

when: hostvars['localhost']['local_installer']

- name: Install latest doca using network way
block:
- name: Set Redhat distro
ansible.builtin.set_fact:
distro: "{{ 'rhel' + ansible_distribution_major_version }}"

- name: Add doca repository
ansible.builtin.command: dnf config-manager --add-repo "{{ doca_repo_url }}"
changed_when: false

when: not hostvars['localhost']['local_installer']

- name: Clean cache
ansible.builtin.command: yum clean all
changed_when: false

- name: Install DOCA SDK
ansible.builtin.package:
name: "{{ item }}"
state: present
with_items:
- doca-runtime
- doca-tools
notify:
- Reboot node

- name: Flush handler to reboot the node
ansible.builtin.meta: flush_handlers
Loading

0 comments on commit b0fcb7a

Please sign in to comment.