NCCL UCX plugin

Requirements

IB or ROCE enabled HCA
Nvidia CUDA Toolkit
Mellanox OFED 5.0-1.0.0.0
- Available at https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed
NCCL 2.6 (https://github.com/nvidia/nccl).
HPC-X 2.6
- Available at https://www.mellanox.com/products/hpc-x-toolkit
OpenUCX 1.8 (https://github.com/openucx/ucx) with CUDA support
- Optional: build UCX with multithread support to use with multithread NCCL
- Optional: build UCX with gdrcopy support. Might improve small collectives performance
Optional: GPUDirectRDMA driver (https://github.com/Mellanox/nv_peer_memory)
- https://www.mellanox.com/products/GPUDirect-RDMA

Build

Please make sure that all requirements are satisfied.

Download and load HPC-X for your OS and MOFED versions

OS_DISTRO=ubuntu18.04-x86_64
MLNX_OFED=MLNX_OFED_LINUX-5.0-1.0.0.0
wget http://content.mellanox.com/hpc/hpc-x/v2.6/hpcx-v2.6.0-gcc-${MLNX_OFED}-${OS_DISTRO}.tbz -O hpcx.tbz
tar xjf hpcx.tbz
module use hpcx-v2.6.0-gcc-${MLNX_OFED}-${OS_DISTRO}/modulefiles
module load hpcx

Build NCCL-UCX plugin

Without Mellanox SHARP support

% git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins
% cd nccl-rdma-sharp-plugins
% ./autogen.sh
% ./configure --with-ucx=$UCX_DIR
% make
% make install

With Mellanox SHARP support

% git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins
% cd nccl-rdma-sharp-plugins
% ./autogen.sh
% ./configure --with-ucx=$UCX_DIR --with-sharp=$HPCX_SHARP_DIR
% make
% make install

Please refer to SHARP deployment guide for further details https://docs.mellanox.com/display/SHARPv200

Run

NCCL automaticaly picks up network plugin when it available in library search path.

# libnccl_net.so is in <plugin_install_dir>/lib
% export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH
% <run command>

Additionally please enable UCX plugin adding to the command line

% export NCCL_PLUGIN_P2P=ucx
% <run command>

Performance tuning considerations

Depending on the server configuration different UCX parameters are needed to get best performance. Here are the some common usecases:

GPU and HCA share common PCIe switch:

In such scenario GPU Direct RDMA gives the best possible performance. Please export following environment variables before running nccl command line:

% export UCX_RNDV_THRESH=0 UCX_RNDV_SCHEME=get_zcopy
% <run command>

For DGX kind of setup where many HCAs

% export UCX_TLS=dc,cuda_copy,cuda_ipc
% <run command>

For further performance improvements, check hardware tag matching feature that can be used as following

% export UCX_RC_MLX5_TM_ENABLE=y UCX_TM_THRESH=1 UCX_RC_MLX5_RX_QUEUE_LEN=512 UCX_RC_MLX5_TM_SEG_SIZE=2M UCX_RNDV_THRESH=inf
% <run command>

Known issues

By default NCCL builds as static library to enable portability. In such case user must explicitly disable memory type cache feature in UCX to prevent wrong memory type detection and program fails. Please add the following variable to the command line:

% export UCX_MEMTYPE_CACHE=n
% <run command>

NCCL Tests benchmark example

For NCCL-UCX performance benchmarking NCCL Tests can be used (https://github.com/nvidia/nccl-tests)

mpirun \
    -np 2 \
    --bind-to socket \
    -x LD_LIBRARY_PATH \
    -x UCX_TLS=rc_x,cuda_copy \
    -x UCX_RNDV_THRESH=0 \
    -x UCX_MEMTYPE_CACHE=n \
    -x NCCL_COLLNET_ENABLE=0 \
    -x NCCL_PLUGIN_P2P=ucx \
    -x NCCL_DEBUG=info \
    -x NCCL_DEBUG_SUBSYS=NET \
    -x NCCL_IB_HCA=mlx5_0:1 \
    $NCCL_TEST_HOME/build/all_reduce_perf -b 128 -e 128M -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1

# nThread 1 nGpus 1 minBytes 128 maxBytes 134217728 step: 2(factor) warmup iters: 100 iters: 50 validation: 1
#
# Using devices
#   Rank  0 Pid   7198 on  host1 device  0 [0x06] Tesla V100-SXM2-32GB
#   Rank  1 Pid   4890 on  host2 device  0 [0x06] Tesla V100-SXM2-32GB
host1:7198:7198 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.3<0>
NCCL version 2.6.0a0+cuda10.1
host2:4890:4890 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.4<0>
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Worker address length: 55
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
         128            32   float     sum    12.52    0.01    0.01  0e+00    12.98    0.01    0.01  0e+00
         256            64   float     sum    13.27    0.02    0.02  0e+00    12.60    0.02    0.02  0e+00
         512           128   float     sum    13.93    0.04    0.04  0e+00    13.35    0.04    0.04  0e+00
        1024           256   float     sum    14.49    0.07    0.07  0e+00    13.33    0.08    0.08  0e+00
        2048           512   float     sum    16.31    0.13    0.13  0e+00    15.39    0.13    0.13  0e+00
        4096          1024   float     sum    18.68    0.22    0.22  0e+00    18.34    0.22    0.22  0e+00
        8192          2048   float     sum    21.15    0.39    0.39  0e+00    20.24    0.40    0.40  0e+00
       16384          4096   float     sum    25.41    0.64    0.64  0e+00    24.88    0.66    0.66  0e+00
       32768          8192   float     sum    30.45    1.08    1.08  0e+00    29.19    1.12    1.12  0e+00
       65536         16384   float     sum    54.06    1.21    1.21  0e+00    51.16    1.28    1.28  0e+00
      131072         32768   float     sum    57.06    2.30    2.30  0e+00    56.23    2.33    2.33  0e+00
      262144         65536   float     sum    69.92    3.75    3.75  0e+00    69.38    3.78    3.78  0e+00
      524288        131072   float     sum    95.06    5.52    5.52  0e+00    94.17    5.57    5.57  0e+00
     1048576        262144   float     sum    144.1    7.28    7.28  0e+00    142.3    7.37    7.37  0e+00
     2097152        524288   float     sum    234.8    8.93    8.93  0e+00    232.5    9.02    9.02  0e+00
     4194304       1048576   float     sum    417.0   10.06   10.06  0e+00    417.0   10.06   10.06  0e+00
     8388608       2097152   float     sum    801.4   10.47   10.47  0e+00    799.1   10.50   10.50  0e+00
    16777216       4194304   float     sum   1583.1   10.60   10.60  0e+00   1581.8   10.61   10.61  0e+00
    33554432       8388608   float     sum   3141.8   10.68   10.68  0e+00   3142.9   10.68   10.68  0e+00
    67108864      16777216   float     sum   6253.9   10.73   10.73  0e+00   6247.8   10.74   10.74  0e+00
   134217728      33554432   float     sum    12381   10.84   10.84  0e+00    12396   10.83   10.83  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.53304
#

Provide feedback

Saved searches

Use saved searches to filter your results more quickly