-
Notifications
You must be signed in to change notification settings - Fork 32
NCCL UCX plugin
- IB or ROCE enabled HCA
- Nvidia CUDA Toolkit
- NCCL 2.6 (https://github.com/nvidia/nccl).
- OpenUCX 1.8 (https://github.com/openucx/ucx) with CUDA support
- Optional: build UCX with multithread support to use with multithread NCCL (optional)
- Optional: build UCX with gdrcopy support. Might improve small collectives performance
- Optional: GPUDirectRDMA driver (https://github.com/Mellanox/nv_peer_memory)
Please make sure that all requirements are satisfied.
You can build it as follows:
% git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins
% cd nccl-rdma-sharp-plugins
% ./autogen.sh
% ./configure --with-ucx=$UCX_DIR
% make
% make install
To build plugin with UCX and Mellanox SHARP support
% git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins
% cd nccl-rdma-sharp-plugins
% ./autogen.sh
% ./configure --with-ucx=$UCX_DIR --with-sharp=$SHARP_DIR
% make
% make install
Please refet to SHARP deployment guide for futher details https://docs.mellanox.com/display/SHARPv200
NCCL automaticaly picks up network plugin when it available in library search path.
# libnccl_net.so is in <plugin_install_dir>/lib
export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH
Additonally please enable UCX plugin adding to the command line
NCCL_PLUGIN_P2P=ucx
Depending on the server configuration different UCX parameters are needed to get best performance. Here are the some common usecases:
In such scenario GPU Direct RDMA gives the best possbile performance. Please export following environment variables before running nccl command line:
% export UCX_RNDV_THRESH=0 UCX_RNDV_SCHEME=get_zcopy
% <run command>
For futher performance improvements, check hardware tag matching feature that can be used as following
% export UCX_RC_MLX5_TM_ENABLE=y UCX_TM_THRESH=1 UCX_RC_MLX5_RX_QUEUE_LEN=512 UCX_RC_MLX5_TM_SEG_SIZE=2M UCX_RNDV_THRESH=inf
% <run command>
By default NCCL builds as static library to enable portability. In such case user must explicitly disable memory type cache feature in UCX to prevent wrong memory type detection and program fails. Please add the following variable to the command line:
% export UCX_MEMTYPE_CACHE=n
% <run command>
For NCCL-UCX performance benchmarking NCCL Tests can be used (https://github.com/nvidia/nccl-tests)
mpirun \
-np 2 \
--bind-to socket \
-x LD_LIBRARY_PATH \
-x UCX_TLS=rc_x,cuda_copy \
-x UCX_RNDV_THRESH=0 \
-x UCX_MEMTYPE_CACHE=n \
-x NCCL_COLLNET_ENABLE=0 \
-x NCCL_PLUGIN_P2P=ucx \
-x NCCL_DEBUG=info \
-x NCCL_DEBUG_SUBSYS=NET \
-x NCCL_IB_HCA=mlx5_0:1 \
$NCCL_TEST_HOME/build/all_reduce_perf -b 128 -e 128M -f 2 -g 1 -n 50 -w 100 -p 0 -z 0 -t 1 -c 1
# nThread 1 nGpus 1 minBytes 128 maxBytes 134217728 step: 2(factor) warmup iters: 100 iters: 50 validation: 1
#
# Using devices
# Rank 0 Pid 7198 on host1 device 0 [0x06] Tesla V100-SXM2-32GB
# Rank 1 Pid 4890 on host2 device 0 [0x06] Tesla V100-SXM2-32GB
host1:7198:7198 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.3<0>
NCCL version 2.6.0a0+cuda10.1
host2:4890:4890 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:1.1.21.4<0>
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 00 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Thread mode multi is not supported
host1:7198:7226 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 00 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host2:4890:4920 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host2:4890:4920 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 0
host1:7198:7226 [0] NCCL INFO Ring 01 : 1[6000] -> 0[6000] [receive] via NET/UCX/0/GDRDMA
host2:4890:4920 [0] NCCL INFO Worker address length: 55
host1:7198:7226 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 6000 / HCA 0 (distance 2 <= 3), read 1
host1:7198:7226 [0] NCCL INFO Ring 01 : 0[6000] -> 1[6000] [send] via NET/UCX/0/GDRDMA
host1:7198:7226 [0] NCCL INFO Worker address length: 55
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
128 32 float sum 12.52 0.01 0.01 0e+00 12.98 0.01 0.01 0e+00
256 64 float sum 13.27 0.02 0.02 0e+00 12.60 0.02 0.02 0e+00
512 128 float sum 13.93 0.04 0.04 0e+00 13.35 0.04 0.04 0e+00
1024 256 float sum 14.49 0.07 0.07 0e+00 13.33 0.08 0.08 0e+00
2048 512 float sum 16.31 0.13 0.13 0e+00 15.39 0.13 0.13 0e+00
4096 1024 float sum 18.68 0.22 0.22 0e+00 18.34 0.22 0.22 0e+00
8192 2048 float sum 21.15 0.39 0.39 0e+00 20.24 0.40 0.40 0e+00
16384 4096 float sum 25.41 0.64 0.64 0e+00 24.88 0.66 0.66 0e+00
32768 8192 float sum 30.45 1.08 1.08 0e+00 29.19 1.12 1.12 0e+00
65536 16384 float sum 54.06 1.21 1.21 0e+00 51.16 1.28 1.28 0e+00
131072 32768 float sum 57.06 2.30 2.30 0e+00 56.23 2.33 2.33 0e+00
262144 65536 float sum 69.92 3.75 3.75 0e+00 69.38 3.78 3.78 0e+00
524288 131072 float sum 95.06 5.52 5.52 0e+00 94.17 5.57 5.57 0e+00
1048576 262144 float sum 144.1 7.28 7.28 0e+00 142.3 7.37 7.37 0e+00
2097152 524288 float sum 234.8 8.93 8.93 0e+00 232.5 9.02 9.02 0e+00
4194304 1048576 float sum 417.0 10.06 10.06 0e+00 417.0 10.06 10.06 0e+00
8388608 2097152 float sum 801.4 10.47 10.47 0e+00 799.1 10.50 10.50 0e+00
16777216 4194304 float sum 1583.1 10.60 10.60 0e+00 1581.8 10.61 10.61 0e+00
33554432 8388608 float sum 3141.8 10.68 10.68 0e+00 3142.9 10.68 10.68 0e+00
67108864 16777216 float sum 6253.9 10.73 10.73 0e+00 6247.8 10.74 10.74 0e+00
134217728 33554432 float sum 12381 10.84 10.84 0e+00 12396 10.83 10.83 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 4.53304
#