Releases: NVIDIA-Merlin/HugeCTR
Merlin: HugeCTR V3.1
What's New in Version 3.1
-
MLPerf v1.0 Integration: We've integrated MLPerf optimizations for DLRM training and enabled them as configurable options in Python interface. Specifically, we have incorporated AsyncRaw data reader, HybridEmbedding, FusedReluBiasFullyConnectedLayer, overlapped pipeline, holistic CUDA Graph and so on. The performance of 14-node DGX-A100 DLRM training with Python APIs is comparable to CLI usage. For more information, see HugeCTR Python Interface and DLRM Sample.
-
Enhancements to the Python Interface: We’ve enhanced the Python interface for HugeCTR so that you no longer have to manually create a JSON configuration file. Our Python APIs can now be used to create the computation graph. They can also be used to dump the model graph as a JSON object and save the model weights as binary files so that continuous training and inference can take place. We've added an Inference API that takes Norm or Parquet datasets as input to facilitate the inference process. For more information, see HugeCTR Python Interface and HugeCTR Criteo Notebook.
-
New Interface for Unified Embedding: We’re introducing a new interface to simplify the use of embeddings and datareaders. To help you specify the number of keys in each slot, we added
nnz_per_slot
andis_fixed_length
. You can now directly configure how much memory usage you need by specifyingworkspace_size_per_gpu_in_mb
instead ofmax_vocabulary_size_per_gpu
. For convenience,mean/sum
is used in combinators instead of 0 and 1. In cases where you don't know which embedding type you should use, you can specifyuse_hash_table
and let HugeCTR automatically select the embedding type based on your configuration. For more information, see HugeCTR Python Interface. -
Multi-Node Support for Embedding Training Cache (MOS): We’ve enabled multi-node support for the embedding training cache. You can now train a model with a terabyte-size embedding table using one node or multiple nodes even if the entire embedding table can't fit into the GPU memory. We're also introducing the host memory (HMEM) based parameter server (PS) along with its SSD-based counterpart. If the sparse model can fit into the host memory of each training node, the optimized HMEM-based PS can provide better model loading and dumping performance with a more effective bandwidth. For more information, see HugeCTR Python Interface.
-
Enhancements to the Multi-Nodes TensorFlow Plugin: The Multi-Nodes TensorFlow Plugin now supports multi-node synchronized training via tf.distribute.MultiWorkerMirroredStrategy. With minimal code changes, you can now easily scale your single GPU training to multi-node multi GPU training. The Multi-Nodes TensorFlow Plugin also supports multi-node synchronized training via Horovod. The inputs for embedding plugins are now data parallel, so the datareader no longer needs to preprocess data for different GPUs based on concrete embedding algorithms.
-
NCF Model Support: We've added support for the NCF model, as well as the GMF and NeuMF variant models. With this enhancement, we're introducing a new element-wise multiplication layer and HitRate evaluation metric. Sample code was added that demonstrates how to preprocess user-item interaction data and train a NCF model with it. New examples have also been added that demonstrate how to train NCF models using MovieLens datasets.
-
DIN and DIEN Model Support: All of our layers support the DIN model. The following layers support the DIEN model: FusedReshapeConcat, FusedReshapeConcatGeneral, Gather, GRU, PReLUDice, ReduceMean, Scale, Softmax, and Sub. We also added sample code to demonstrate how to use the Amazon dataset to train the DIN model.
-
Multi-Hot Support for Parquet Datasets: We've added multi-hot support for parquet datasets, so you can now train models with a paraquet dataset that contains both one hot and multi-hot slots.
-
Mixed Precision (FP16) Support in More Layers: The MultiCross layer now supports mixed precision (FP16). All layers now support FP16.
-
Mixed Precision (FP16) Support in Inference: We've added FP16 support for the inference pipeline. Therefore, dense layers can now adopt FP16 during inference.
-
Optimizer State Enhancements for Continuous Training: You can now store optimizer states that are updated during continuous training as files, such as the Adam optimizer's first moment (m) and second moment (v). By default, the optimizer states are initialized with zeros, but you can specify a set of optimizer state files to recover their previous values. For more information about
dense_opt_states_file
andsparse_opt_states_file
, see Python Interface. -
New Library File for GPU Embedding Cache Data: We’ve moved the header/source code of the GPU embedding cache data structure into a stand-alone folder. It has been compiled into a stand-alone library file. Similar to HugeCTR, your application programs can now be directly linked from this new library file for future use. For more information, see our GPU Embedding Cache ReadMe.
-
Embedding Plugin Enhancements: We’ve moved all the embedding plugin files into a stand-alone folder. The embedding plugin can be used as a stand-alone python module, and works with TensorFlow to accelerate the embedding training process.
-
Adagrad Support: Adagrad can now be used to optimize your embedding and network. To use it, change the optimizer type in the Optimizer layer and set the corresponding parameters.
Merlin: HugeCTR V3.1 Beta
Release Notes
Bigger model and large scale training are always the main requirements in recommendation system. In v3.1, we provide a set of new optimizations for good scalability as below, and now they are available in this beta version.
- Distributed Hybrid embedding - Model/data parallel split of embeddings based on statistical access frequency to minimize embedding exchange traffic.
- Optimized communication collectives - Hierarchical multi-node all-to-all for NVLINK aggregation and oneshot algorithm for All-reduce.
- Optimized data reader - Async I/O based data reader to maximize I/O utilization, minimize interference with collectives and eval caching.
- MLP fusions - Fused GEMM + Relu + Bias fprop and GEMM + dRelu + bgrad bprop.
- Compute-communication overlap - Generalized embedding and bottom MLP overlap.
- Holistic CUDA graph - Full iteration graph capture to reduce launch latencies and jitter.
Merlin: HugeCTR V3.0.1
What's New in Version 3.0.1
- DLRM Inference Benchmark: We've added two detailed Jupyter notebooks to illustrate how to train and deploy a DLRM model with HugeCTR whilst benchmarking its performance. The inference notebook demonstrates how to create Triton and HugeCTR backend configs, prepare the inference data, and deploy a trained model by another notebook on Triton Inference Server. It also shows the way of benchmarking its performance (throughput and latency), based on Triton Performance Analyzer. For more details, check out our HugeCTR inference repository.
- FP16 Speicific Optimization in More Dense Layers: We've optimized DotProduct, ELU, and Sigmoid layers based on
__half2
vectorized loads and stores, so that they better utilize device memory bandwidth. Now most layers have been optimized in such a way except MultiCross, FmOrder2, ReduceSum, and Multiply layers. - More Finely Tunable Synthetic Data Generator: Our new data generator can generate uniformly distributed datasets in addition to power law based datasets. Instead of specifying
vocabulary_size
in total andmax_nnz
, you can specify such information per categorical feature. See our user guide to learn its changed usage. - Decreased Memory Demands of Trained Model Exportation: To prevent the out of memory error from happening in saving a trained model including a very large embedding table, the actual amount of memory allocated by the related functions was effectively reduced.
- CUDA Graph Compatible Dropout Layer: HugeCTR Dropout Layer uses cuDNN by default, so that it can be used together with CUDA Graph. In the previous version, if Dropout was used, CUDA Graph was implicitly turned off.
Merlin: HugeCTR V3.0
Release Notes
What’s New in Version 3.0
-
Inference Support: To streamline the recommender system workflow, we’ve implemented a custom HugeCTR backend on the NVIDIA Triton Inference Server. The HugeCTR backend leverages the embedding cache and parameter server to efficiently manage embeddings of different sizes and models in a hierarchical manner. For additional information, see our inference repository.
-
New High-Level API: You can now also construct and train your models using the Python interface with our new high-level API. See our preview example code to grasp how it works.
-
FP16 Support in More Layers: All the layers except
MultiCross
support mixed precision mode. We’ve also optimized some of the FP16 layer implementations based on vectorized loads and stores. -
Enhanced TensorFlow Embedding Plugin: Our embedding plugin now supports
LocalizedSlotSparseEmbeddingHash
mode. With this enhancement, the DNN model no longer needs to be split into two parts since it now connects with the embedding op throughMirroredStrategy
within the embedding layer. -
Extended Model Oversubscription: We’ve extended the model oversubscription feature to support
LocalizedSlotSparseEmbeddingHash
andLocalizedSlotSparseEmbeddingHashOneHot
. -
Epoch-Based Training Enhancement: The
num_epochs
option in the Solver clause can now be used with theRaw
dataset format. -
Deprecation of the
eval_batches
Parameter: Theeval_batches
parameter has been deprecated and replaced with themax_eval_batches
andmax_eval_samples
parameters. In epoch mode, these parameters control the maximum number of evaluations. An error message will appear when attempting to use theeval_batches
parameter. -
MultiplyLayer
Renamed: To clarify what theMultiplyLayer
does, it was renamed toWeightMultiplyLayer
. -
Optimized Initialization Time: HugeCTR’s initialization time, which includes the GEMM algorithm search and parameter initialization, was significantly reduced.
-
Sample Enhancements: Our samples now rely upon the Criteo 1TB Click Logs dataset instead of the Kaggle Display Advertising Challenge dataset. Our preprocessing scripts (Perl, Pandas, and NVTabular) have also been unified and simplified.
-
Configurable DataReader Worker: You can now specify the number of data reader workers, which run in parallel, with the
num_workers
parameter. Its default value is 12. However, if you are using the Parquet data reader, you can't configure thenum_workers
parameter since it always corresponds to the number of active GPUs.
Known Issues
-
Since the automatic plan file generator isn't able to handle systems that contain one GPU, you must manually create a JSON plan file with the following parameters and rename it using the name listed in the HugeCTR configuration file:
{"type": "all2all", "num_gpus": 1, "main_gpu": 0, "num_steps": 1, "num_chunks": 1, "plan": [[0, 0]], and "chunks": [1]}
. -
If using a system that contains two GPUs with two NVLink connections, the auto plan file generator will print the following warning message:
RuntimeWarning: divide by zero encountered in true_divide
. This is an erroneous warning message and should be ignored. -
The current plan file generator doesn't support a system where the NVSwitch or a full peer-to-peer connection between all nodes is unavailable.
-
Users need to set an
export CUDA_DEVICE_ORDER=PCI_BUS_ID
environment variable to ensure that the CUDA runtime and driver have a consistent GPU numbering. -
LocalizedSlotSparseEmbeddingOneHot
only supports a single-node machine where all the GPUs are fully connected such as NVSwitch. -
HugeCTR version 3.0 crashes when running the DLRM sample on DGX2 due to a CUDA Graph issue. To run the sample on DGX2, disable the CUDA Graph by setting the
cuda_graph
parameter to false even if it degrades the performance a bit. This issue doesn't exist when using the DGX A100. -
The HugeCTR embedding TensorFlow plugin only works with single-node machines.
-
The HugeCTR embedding TensorFlow plugin assumes that the input keys are in
int64
and its output is infloat
. -
If the number of samples in a dataset is not divisible by the batch size when in epoch mode and using the
num_epochs
instead ofmax_iter
, a few remaining samples are truncated. If the training dataset is large enough, its impact can be negligible. If you want to minimize the wasted batches, try adjusting the number of data reader workers. For example, using a file list source, set thenum_workers
parameter to an advisor based on the number of data files in the file list.
Merlin: HugeCTR V2.3
Release Notes
What's New in Version 2.3
We’ve implemented the following enhancements to improve usability and performance:
-
Python Interface: To enhance the interoperability with NVTabular and other Python-based libraries, we're introducing a new Python interface for HugeCTR. If you are already using HugeCTR with JSON, the transition to Python will be seamless for you as you'll only have to locate the
hugectr.so
file and set thePYTHONPATH
environment variable. You can still configure your model in your JSON config file, but the training options such asbatch_size
must be specified throughhugectr.solver_parser_helper()
in Python. For additional information regarding how to use the HugeCTR Python API and comprehend its API signature, see our Jupyter Notebook tutorial. -
HugeCTR Embedding with Tensorflow: To help users easily integrate HugeCTR’s optimized embedding into their Tensorflow workflow, we now offer the HugeCTR embedding layer as a Tensorflow plugin. To better understand how to intall, use, and verify it, see our Jupyter notebook tutorial. It also demonstrates how you can create a new Keras layer
EmbeddingLayer
based on thehugectr_tf_ops.py
helper code that we provide. -
Model Oversubscription: To enable a model with large embedding tables that exceeds the single GPU's memory limit, we added a new model prefetching feature, giving you the ability to load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage. To use this feature, you need to split your dataset into multiple sub-datasets while extracting the unique key sets from them. This feature can only currently be used with a
Norm
dataset format and its corresponding file list. This feature will eventually support all embedding types and dataset formats. We revised ourcriteo2hugectr
tool to support the key set extraction for the Criteo dataset. For additional information, see our Python Jupyter Notebook to learn how to use this feature with the Criteo dataset. Please note that The Criteo dataset is a common use case, but model prefetching is not limited to only this dataset. -
Enhanced AUC Implementation: To enhance the performance of our AUC computation on multi-node environments, we redesigned our AUC implementation to improve how the computational load gets distributed across nodes.
-
Epoch-Based Training: In addition to
max_iter
, a HugeCTR user can setnum_epochs
in the Solver clause of their JSON config file. This mode can only currently be used withNorm
dataset formats and their corresponding file lists. All dataset formats will be supported in the future. -
Multi-Node Training Tutorial: To better support multi-node training use cases, we added a new a step-by-step tutorial.
-
Power Law Distribution Support with Data Generator: Because of the increased need for generating a random dataset whose categorical features follows the power-law distribution, we revised our data generation tool to support this use case. For additional information, refer to the
--long-tail
description here. -
Multi-GPU Preprocessing Script for Criteo Samples: Multiple GPUs can now be used when preparing the dataset for our samples. For additional information, see how preprocess_nvt.py is used to preprocess the Criteo dataset for DCN, DeepFM, and W&D samples.
Known Issues
- Since the automatic plan file generator is not able to handle systems that contain one GPU, a user must manually create a JSON plan file with the following parameters and rename using the name listed in the HugeCTR configuration file:
{"type": "all2all", "num_gpus": 1, "main_gpu": 0, "num_steps": 1, "num_chunks": 1, "plan": [[0, 0]], "chunks": [1]}
. - If using a system that contains two GPUs with two NVLink connections, the auto plan file generator will print the following warning message:
RuntimeWarning: divide by zero encountered in true_divide
. This is an erroneous warning message and should be ignored. - The current plan file generator doesn't support a system where the NVSwitch or a full peer-to-peer connection between all nodes is unavailable.
- Users need to set an
export CUDA_DEVICE_ORDER=PCI_BUS_ID
environment variable to ensure that the CUDA runtime and driver have a consistent GPU numbering. LocalizedSlotSparseEmbeddingOneHot
only supports a single-node machine where all the GPUs are fully connected such as NVSwitch.- HugeCTR version 2.2.1 crashes when running our DLRM sample on DGX2 due to a CUDA Graph issue. To run the sample on DGX2, disable the use of CUDA Graph with
"cuda_graph": false
even if it degrades the performance a bit. We are working on fixing this issue. This issue doesn't exist when using the DGX A100. - The model prefetching feature is only available in Python. Currently, a user can only use this feature with the
DistributedSlotSparseEmbeddingHash
embedding and theNorm
dataset format on single GPUs. This feature will eventually support all embedding types and dataset formats. - The HugeCTR embedding TensorFlow plugin only works with single-node machines.
- The HugeCTR embedding TensorFlow plugin assumes that the input keys are in
int64
and its output is infloat
. - When using our embedding plugin, please note that the
fprop_v3
function, which is available intools/embedding_plugin/python/hugectr_tf_ops.py
, only works withDistributedSlotSparseEmbeddingHash
.
Merlin: HugeCTR V2.2.1
What’s New in Version 2.2.1
In HugeCTR version 2.2.1, we enriched the user-convenience features together with the refactoring efforts and bug fixes.
-
Dataset in Parquet Format Support : HugeCTR data reader was extended to support Parquet format. The preprocessed dataset and its metadata can be generated with nvTabular
-
GPU-Powered Preprocessing Script : The preprocessing script used for HugeCTR samples such as DCN, DeepFM and W&D was rewritten in nvTabular. This GPU accelerated script doesn’t use too much host memory anymore.
-
Preprocessing Tool for DLRM Sample : To make it easier for the user to run our DLRM sample, a preprocessing tool written in CUDA C++ was added.
-
Use of RAPIDS MLPrims : Some existing layers were rewritten to utilize the highly optimized machine learning primitives (MLPrims) of RAPIDs.
-
Reorganization of Submodules : All the submodules were moved to
third_party
directory. -
Revived Pascal Support : The support for Pascal Architecture, e.g., P100 was added back. However, with a Pascal graphic card,
InteractionLayer
doesn’t support FP16. -
Compile Time Reduction : By modularizing the embedding related code into several files, HugeCTR compile time was improved.
-
Refactoring of
Tensor
andGeneralBuffer
: In HugeCTR,Tensor
andGeneralBuffer
are used for memory management and access control. In the version 2.2.1, they were refactored to clarify their responsibilities and support different memory kinds, .e.g, Host, Device and Unified. Check their interface changes if you are using them to add a new layer.
Merlin: HugeCTR V2.2
New Features in Version 2.2
HugeCTR version 2.2 adds a lot of features to enhance its usability and performance. HugeCTR is not only a high-performance refereence design for framework designers but also a self contained training framework.
- Algorithm Search : HugeCTR runs an exhaustive algorithm search for each fully connected layer to find the best performant one.
- AUC : An user can choose to use AUC as an evaluation metric in addition to AverageLoss. It is also possible to stop training when AUC reaches a speicifed threshold.
- Batch Shuffle in Training Data Set : Training data batch shuffling is supported.
- Different Batch Sizes for Training and Evaluation : An user can speicify the different batch sizes for training and evalation. It can be useful to tune overal performance.
- Full FP16 pipeline : In order to data and compute throughputs, We added the full FP16 pipeline.
- Fused Fully Connected Layer : In FP16 mode, you can choose to use a specilized fully connected layer fused with ReLU activation function.
- Evaluation Data Caching on Device : For GPUs with large memory capacity like A100, a user can choose to cache data batches for small evaluation data sets.
- Interaction Layer : We added Intearction layer used for popular models such as DLRM.
- Optimized Data Reader for Raw Data Format : RAW data format is supported to simplify the one hot data reading and achieve better performance.
- Deep Learning Recommendation Model (DLRM) : We eanbled and optimized the training of DLRM. Please find more details in samples/dlrm.
- Learning Rate Scheduling : Different learning rate scheduling is supported.
- Weight Initialization Methods : For each trainable layer, a use can choose which method ,e.g., XavierUnifrom, Zero, etc is used for its weight initialization.
- Ampere Support : We tested and optimized HugeCTR for Ampere Architecture.
HugeCTR v2.2 Beta Release
HugeCTR is a high-efficiency GPU framework designed for Click-Through-Rate (CTR) estimation training. In version 2.2 beta release we introduce several important feature updates:
- Algorithm Search : Support algorithm selection in fully connected layers for better performance.
- AUC : Support AUC calculation for accuracy evaluation.
- Batch shuffle and last batch in eval : Support batch shuffle and the last batch during evaluation won’t be dropped.
- Different batch size in training and evaluation : Support this for best performance in evaluation.
- Full mixed precision training pipeline: Support full mixed precision training [1].
- Fused fully connected layer : Fused bias adding and relu activation into a single layer.
- Caching evaluation data on device : For the GPUs with large memory like A100, we can use caching data for small evaluation data sets.
- Interaction layer
- Optimized data reader for raw format
- Learning rate scheduling
[1] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, H. Wu Mixed Precision Training https://arxiv.org/abs/1710.03740
HugeCTR A100 PREVIEW
HugeCTR is a high-efficiency GPU framework designed for Click-Through-Rate (CTR) estimation training, and the new released NVIDIA A100 GPU has excellent acceleration on various scales for AI, data analysis and high performance computing (HPC), and meet extremely severe computing challenges. To demonstrate HugeCTR’s performance on A100 GPU, this version is developed to leverage new features of the latest GPU.
HugeCTR A100 PREVIEW
v2.1_a100 a100-preview