This repository contains the Slipstream framework implementation for accelerating recommendation models training.
Publicly available datasets (Criteo Kaggle, Criteo Terabyte, Avazu, Taobao etc) can be downloaded and required pre-processing for training.
Follow below steps for downloading raw datasets and then pre-processing the required dataset for training purpose.
cd Slipstream/DLRM
-
The code supports interface with the Criteo Kaggle Display Advertising Challenge Dataset.
- Please do the following to prepare the dataset for use with DLRM code:
- First, specify the raw data file (train.txt) as downloaded with --raw-data-file=<./input/kaggle/train.txt>
- This is then pre-processed (categorize, concat across days...) to allow using with dlrm code
- The processed data is stored as .npz file in ./input/kaggle/.npz
- The processed file (.npz) can be used for subsequent runs with --processed-data-file=<./input/kaggle/.npz>
- Criteo kaggle can be pre-processed using the following script
./bench/dlrm_s_criteo_kaggle.sh
- Please do the following to prepare the dataset for use with DLRM code:
-
The code supports interface with the Criteo Terabyte Dataset.
- Please do the following to prepare the dataset for use with DLRM code:
- First, download the raw data files day_0.gz, ...,day_23.gz and unzip them
- Specify the location of the unzipped text files day_0, ...,day_23, using --raw-data-file=<./input/terabyte/day> (the day number will be appended automatically)
- These are then pre-processed (categorize, concat across days...) to allow using with dlrm code
- The processed data is stored as .npz file in ./input/terabyte/.npz
- The processed file (.npz) can be used for subsequent runs with --processed-data-file=<./input/terabyte/.npz>
- Criteo Terabyte can be pre-processed using the following script
./bench/dlrm_s_criteo_terabyte.sh
- Please do the following to prepare the dataset for use with DLRM code:
DLRM baseline can be run on a hybrid CPU-GPU system using following script
./run_dlrm_baseline.sh
Input training dataset requires to be segregated into hold and cold inputs and hot and cold embeddings required for Slipstream. Based on available GPU memory for hot embedding entries,
Input segregation can be executed on a CPU system using following script
./run_input_segregation.sh
FAE baseline can be run on a hybrid CPU-GPU system using following script
./run_fae_baseline.sh
Slipstream identifies the stale embeddings via threshold (
Slipstream can be run on a hybird CPU-GPU system using following script
./run_slipstream.sh
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.