Skip to content

Merlin: HugeCTR V3.0.1

Compare
Choose a tag to compare
@zehuanw zehuanw released this 12 Apr 02:26

What's New in Version 3.0.1

  • DLRM Inference Benchmark: We've added two detailed Jupyter notebooks to illustrate how to train and deploy a DLRM model with HugeCTR whilst benchmarking its performance. The inference notebook demonstrates how to create Triton and HugeCTR backend configs, prepare the inference data, and deploy a trained model by another notebook on Triton Inference Server. It also shows the way of benchmarking its performance (throughput and latency), based on Triton Performance Analyzer. For more details, check out our HugeCTR inference repository.
  • FP16 Speicific Optimization in More Dense Layers: We've optimized DotProduct, ELU, and Sigmoid layers based on __half2 vectorized loads and stores, so that they better utilize device memory bandwidth. Now most layers have been optimized in such a way except MultiCross, FmOrder2, ReduceSum, and Multiply layers.
  • More Finely Tunable Synthetic Data Generator: Our new data generator can generate uniformly distributed datasets in addition to power law based datasets. Instead of specifying vocabulary_size in total and max_nnz, you can specify such information per categorical feature. See our user guide to learn its changed usage.
  • Decreased Memory Demands of Trained Model Exportation: To prevent the out of memory error from happening in saving a trained model including a very large embedding table, the actual amount of memory allocated by the related functions was effectively reduced.
  • CUDA Graph Compatible Dropout Layer: HugeCTR Dropout Layer uses cuDNN by default, so that it can be used together with CUDA Graph. In the previous version, if Dropout was used, CUDA Graph was implicitly turned off.