Merlin: HugeCTR 23.04
What's New in Version 23.04
-
Hierarchical Parameter Server Enhancements:
-
HPS Table Fusion: From this release, you can fuse tables of the same embedding vector size in HPS. We support this feature in the HPS plugin for TensorFlow and the Triton backend for HPS.. To turn on table fusion, set
fuse_embedding_table
totrue
in the HPS JSON file. This feature requires that the key values in different tables do not overlap and the embedding lookup layers are not dependent on each other in the model graph. For more information, refer to HPS configuration and HPS table fusion demo notebook. This feature can reduce the embedding lookup latency significantly when there are multiple tables and GPU embedding cache is employed. About 3x speedup is achieved on V100 for the fused case demonstrated in the notebook compared to the unfused one. -
UVM Support: We have upgraded the static embedding solution. For embedding tables whose size exceeds the device memory, we will save high-frequency embeddings in the HBM as an embedding cache and offload the remaining embeddings to the UVM. Compared with the dynamic cache solution that offloads the remaining embeddings to the Volatile DB, the UVM solution has higher CPU lookup throughput. We will support online updating of the UVM solution in a future release. Users can switch between different embedding cache solutions through the embedding_cache_type configuration parameter.
-
Triton Perf Analayzer’s Request Generator: We have added an inference request generator to generate the JSON request format required by Triton Perf Analyzer. By using this request generator together with the model generator, you can use the Triton Perf Analyzer to profile the HPS performance and do stress testing. For API documentation and demo usage, please refer to README
-
-
General Updates:
- DenseLayerComputeConfig: MLP and CrossLayer support asynchronous weight gradient computations with data gradient backpropagation when training. We have added a new member
hugectr DenseLayerComputeConfig
tohugectr.DenseLayer
for configuring the computing behavior. The knob for enabling asynchronous weight gradient computations has been moved fromhugectr.CreateSolver
tohugectr.DenseLayerComputeConfig.async_wgrad
. The knob for controlling the fusion mode of weight gradients and bias gradients has been moved fromhugectr.DenseLayerSwitchs
tohugectr.DenseLayerComputeConfig.fuse_wb
. - Hopper Architecture Support: Users can build HugeCTR from scratch with the compute capability 9.0 (
DSM=90
), so that it can run on Hopper architectures. Note that our NGC container does not support the compute capability yet. Users who are unfamiliar with how to build HugeCTR can refer to the HugeCTR Contribution Guide. - RoCE Support for Hybrid Embedding: With the parameter
CommunicationType.IB_NVLink_Hier
in HybridEmbeddingParams, the RoCE is supported. We have also added 2 environment variablesHUGECTR_ROCE_GID
andHUGECTR_ROCE_TC
so that a user can control the RoCE NIC's GID and traffic class.
https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#hybridembeddingparam-class
- DenseLayerComputeConfig: MLP and CrossLayer support asynchronous weight gradient computations with data gradient backpropagation when training. We have added a new member
-
Documentation Updates:
- Data Reader: We have enhanced our Raw data reader to read multi-hot input data, connecting with an embedding collection seamlessly. The raw dataset format is strengthened as well. Refer to our online documentation for more details. We have refined the description for Norm datasest as well.
- Embedding Collection: We have added the knob
is_exclusive_keys
to enable potencial acceleration if a user has already preprocessed the input of embedding collection to make the resulting tables exclusive with one another. We have also added the nobcomm_strategy
in embedding collection for user to configure optimized communication strategy in multi-node training - HPS Plugin: We have fixed the unit of measurement for DLRM inference benchmark results that leverage the HPS plugin. We have updated the user guide for the HPS plugin for TensorFlow and the HPS plugin for TensorRT
- Embedding Cache: We have updated the usage of three types of embedding cache. We have updated the descriptions of the three types of embedding cache as well.
-
Issues Fixed:
- We added a slots emptiness check to prevent
SparseParam
from being misused. - We revised MPI lifetime service to become MPI init service with slightly greater scope and clearer interface. In this effort, we also fixed a rare bug that could lead access violations during the MPI shutdown procedure.
- We fixed a segment fault that occurs when a GPU has no embedding wgrad to update.
- SOK build & runtime error related to TF version: We made the SOK Experiment](https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/experiment) compatible with the Tensorflow >= v2.11.0. The legacy SOK doesn’t support that and newer versions of Tensorflow.
- HPS requires CPU memory to be at least 2.5x larger than the model size during its initialization. From this release, we parse the model embedding files through chunks and reduce the required memory to 1.3x model size.
- We added a slots emptiness check to prevent
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](#243).
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-