Merlin: HugeCTR V3.1 Beta
Release Notes
Bigger model and large scale training are always the main requirements in recommendation system. In v3.1, we provide a set of new optimizations for good scalability as below, and now they are available in this beta version.
- Distributed Hybrid embedding - Model/data parallel split of embeddings based on statistical access frequency to minimize embedding exchange traffic.
- Optimized communication collectives - Hierarchical multi-node all-to-all for NVLINK aggregation and oneshot algorithm for All-reduce.
- Optimized data reader - Async I/O based data reader to maximize I/O utilization, minimize interference with collectives and eval caching.
- MLP fusions - Fused GEMM + Relu + Bias fprop and GEMM + dRelu + bgrad bprop.
- Compute-communication overlap - Generalized embedding and bottom MLP overlap.
- Holistic CUDA graph - Full iteration graph capture to reduce launch latencies and jitter.