The purpose of this sample is to demonstrate how to build and train a DLRM model with HugeCTR.
You can set up the HugeCTR Docker environment by doing one of the following:
HugeCTR is available as buildable source code, but the easiest way to install and run HugeCTR is to pull the pre-built Docker image, which is available on the NVIDIA GPU Cloud (NGC). This method provides a self-contained, isolated, and reproducible environment for repetitive experiments.
- Pull the HugeCTR NGC Docker by running the following command:
$ docker pull nvcr.io/nvidia/merlin/merlin-training:0.6
- Launch the container in interactive mode with the HugeCTR root directory mounted into the container by running the following command:
$ docker run --gpus=all --rm -it -u $(id -u):$(id -g) -v $(pwd):/hugectr -w /hugectr nvcr.io/nvidia/merlin/merlin-training:0.6
Please refer to Build HugeCTR Docker Containers to build on your own and set up the Docker container. Make sure that HugeCTR is built and installed to the system path /usr/local/hugectr
within the Docker container. Launch the container in interactive mode in the same manner as above, and then set the PYTHONPATH
environment variable inside the Docker container using the following command:
$ export PYTHONPATH=/usr/local/hugectr/lib:$PYTHONPATH
Ensure that you've met the following requirements:
- MLPerf v0.7: DGX A100 or DGX2 (32GB V100)
- MLPerf v1.0: DGX A100 14 nodes
The Terabyte Click Logs provided by CriteoLabs is used in this sample. The row count of each embedding table is limited to 40 million. The data is processed the same way as dlrm. For more information, see Benchmarking. Each sample has 40 32bits integers in which the first integer is label. The next 13 integers are dense features and the following 26 integers are category features.
-
Download the terabyte datasets from the Terabyte Click Logs into the
"${project_home}/samples/dlrm/"
folder. -
Unzip the datasets and name them in the following manner:
day_0
,day_1
, ...,day_23
. -
Preprocess the datasets using the following command:
# Usage: dlrm_raw input_dir output_dir --train {days for training} --test {days for testing} $ dlrm_raw ./ ./ \ --train 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 \ --test 23
This operation will generate
train.bin(671.2GB)
andtest.bin(14.3GB)
.
Run the Python script using the following command:
$ python3 dlrm_terabyte_fp16_64k.py
Run the single node DGX-100 Python script using the following command:
$ python3 dgx_a100.py
Run the 14-node DGX-100 Python script using the following command:
$ numactl --interleave=all python3 dgx_a100_14x8x640.py
IMPORTANT NOTES:
- To run the 14-node DGX-100 training script on Selene, you need to submit the job on the Selene login node properly.
- In v2.2.1, there is a CUDA Graph error that occurs when running this sample on DGX2. To run it on DGX2, specify
"use_cuda_graph = False
withinCreateSolver
in the Python script. For detailed information about this error, see Known Issues. cache_eval_data
is only supported on DGX A100. If you're running DGX2, disable it.
Ensure that you've met the following requirements:
- DGX A100 or DGX2 (32GB V100)
The Kaggle Display Advertising dataset is provided by CriteoLabs. For more information, see https://ailab.criteo.com/ressources/. The original training set contains 45,840,617 examples. Each sample consists of a label (0 if the ad wasn't clicked and 1 if the ad was clicked) and 39 features (13 integer features and 26 categorical features). The dataset is also missing numerous values across the feature columns, which should be preprocessed accordingly. The original test set doesn't contain labels, so it's not used.
-
Go (here) and download the Kaggle Display Advertising Dataset into the
"${project_home}/samples/dlrm/"
folder. As an alternative, you can run the following commands:# download and preprocess $ cd ./samples/dlrm/ $ tar zxvf dac.tar.gz
The
dlrm_raw
tool converts the original data to HugeCTR's raw format and fills the missing values. Onlytrain.txt
will be used to generate raw data. -
Convert the dataset to the HugeCTR raw data format by running the following command:
# Usage: dlrm_raw input_dir out_dir $ dlrm_raw ./ ./
This operation will generate
train.bin(5.5GB)
andtest.bin(700MB)
.
Train with HugeCTR by running the following command:
$ python3 dlrm_kaggle_fp32.py