From 69483128f74c17207d23c8d60701ae161af7a5e8 Mon Sep 17 00:00:00 2001 From: Gabriel Moreira Date: Tue, 18 Apr 2023 19:22:10 -0300 Subject: [PATCH] Improving ranking script docs --- .../quick_start/scripts/ranking/README.md | 81 ++++++++++++++----- 1 file changed, 61 insertions(+), 20 deletions(-) diff --git a/examples/quick_start/scripts/ranking/README.md b/examples/quick_start/scripts/ranking/README.md index e033d4cd4e..7d4688714b 100644 --- a/examples/quick_start/scripts/ranking/README.md +++ b/examples/quick_start/scripts/ranking/README.md @@ -14,23 +14,25 @@ By using MTL, it is typically possible to improve the tasks accuracy for somewha You can find more details in this [post](https://medium.com/nvidia-merlin/building-ranking-models-powered-by-multi-task-learning-with-merlin-and-tensorflow-4b4f993f7cc3) on the multi-task learning building blocks provided by [models](https://github.com/NVIDIA-Merlin/models/) library. -The `ranking.py` script makes it easy to use multi-task learning backed by models library. It is automatically enabled when you provide more than one target column to `--target` arguments. +The `ranking.py` script makes it easy to use multi-task learning backed by models library. It is automatically enabled when you provide more than one target column to `--tasks` argument. ## Supported models The `ranking.py` script makes it easy to use baseline and advanced deep ranking models available in [models](https://github.com/NVIDIA-Merlin/models/) library. -The script can be also used as an **advanced example** that demonstrate [how to set specific hyperparamters using models API](ranking_models.py). +The script can be also used as an **advanced example** that demonstrate [how to set specific hyperparameters using models API](ranking_models.py). ### Baseline ranking architectures -- MLP (`MLPBlock`) - Simple multi-layer perceptron architecture. -- Wide and Deep ([paper](https://dl.acm.org/doi/10.1145/2988450.2988454)) - ... -- DeepFM ([paper](https://arxiv.org/abs/1703.04247)) (`DeepFMModel`) - ... -- DRLM ([paper](https://arxiv.org/abs/1906.00091)) (`DLRMModel`) - ... -- DCN-v2 ([paper](https://dl.acm.org/doi/10.1145/3442381.3450078)) (`DCNModel`) - ... +- **MLP** (`--model=mlp`) - Simple multi-layer perceptron architecture. More info in [MLPBlock](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.MLPBlock.html#merlin.models.tf.MLPBlock). +- **Wide and Deep** - Aims to leverage the ability of neural networks to generalize and capacity of linear models to memorize relevant feature interactions. The deep part is an MLP model, with categorical features represented as embeddings, which are concatenated with continuous features and fed through multiple MLP layers. The wide part is a linear model takes a sparse representation of categorical features (i.e. one-hot or multi-hot representation). More info in [WideAndDeepModel](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.WideAndDeepModel.html#merlin.models.tf.WideAndDeepModel) and its [paper](https://dl.acm.org/doi/10.1145/2988450.2988454). + + +- **DeepFM** (`--model=deepfm`) - DeepFM architecture is a combination of a Factorization Machine and a Deep Neural Network. More info in [DeepFMModel](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.DeepFMModel.html#merlin.models.tf.DeepFMModel) and its [paper](https://arxiv.org/abs/1703.04247). +- **DRLM** (`--model=dlrm`) - Continuous features are concatenated and combined by the bottom MLP to produce an embedding like categorical embeddings. The factorization machines layer perform 2nd level feature interaction of those embeddings, which need to have the same dim. Then those outputs are concatenated and processed through the top MLP layer to output the predictions. More info in [DLRMModel](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.DLRMModel.html#merlin.models.tf.DLRMModel) and its [paper](https://arxiv.org/abs/1906.00091). +- **DCN-v2** (`--model=dcn`) - The Improved Deep & Cross Network combines a MLP network with cross-network for powerful and bounded feature interaction. More info in [DCNModel](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.DCNModel.html#merlin.models.tf.DCNModel) and its [paper](https://dl.acm.org/doi/10.1145/3442381.3450078). ### Multi-task learning architectures -- MMOE ([paper](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007)) (`MMOEBlock`) - The Multi-gate Mixture-of-Experts (MMoE) architecture was introduced by Google in 2018 and is one of the most popular models for multi-task learning on tabular data. It allows parameters to be automatically allocated to capture either shared task information or task-specific information. The core components of MMoE are experts and gates. Instead of using a shared-bottom for all tasks, it has multiple expert sub-networks processing input features independently from each other. Each task has an independent gate, which dynamically selects based on the inputs the level with which the task wants to leverage the output of each expert. The gate is typically just a small MLP sub-network that provides softmax scores over the number of experts given the inputs. Those scores are used as weights for computing a weighted average of the experts’ outputs and form an independent representation for each task. -- CGC ([paper](https://dl.acm.org/doi/10.1145/3383313.3412236)) (`CGCBlock`) - Instead of having tasks sharing all the experts like in MMOE, it propos allowing for splitting task-specific experts and shared experts, in an architecture named Customized Gate Control (CGC) Model. -- PLE ([paper](https://dl.acm.org/doi/10.1145/3383313.3412236)) (`PLEBlock`) - In the same paper introducing CGC, authors proposed stacking multiple CGC models on top of each other to form a multi-level MTL model, so that the model can progressively combine shared and task-specific experts. They name this approach as Progressive Layered Extraction (PLE). Their paper experiments showed accuracy improvements by using PLE compared to CGC. +- **MMOE** (`--model=mmoe`) - The Multi-gate Mixture-of-Experts (MMoE) is one of the most popular models for multi-task learning on tabular data. It allows parameters to be automatically allocated to capture either shared task information or task-specific information. The core components of MMoE are experts and gates. Instead of using a shared-bottom for all tasks, it has multiple expert sub-networks processing input features independently from each other. Each task has an independent gate, which dynamically selects based on the inputs the level with which the task wants to leverage the output of each expert. More info on [MMOEBlock](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.MMOEBlock.html#merlin.models.tf.MMOEBlock) and its [paper](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007). +- **CGC** (`--model=cgc`) - Instead of having tasks sharing all the experts like in MMOE, it allows for splitting task-specific experts and shared experts, in an architecture named Customized Gate Control (CGC) Model. More info on [CGCBlock](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.CGCBlock.html#merlin.models.tf.CGCBlock) and its [paper](https://dl.acm.org/doi/10.1145/3383313.3412236). +- **PLE** (`--model=ple`) - In the same paper introducing CGC, authors proposed stacking multiple CGC models on top of each other to form a multi-level MTL model, so that the model can progressively combine shared and task-specific experts. They name this approach as Progressive Layered Extraction (PLE). Their paper experiments showed accuracy improvements by using PLE compared to CGC. More info on [PLEBlock](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.CGCBlock.html#merlin.models.tf.CGCBlock) and its [paper](https://dl.acm.org/doi/10.1145/3383313.3412236).
Multi-task learning architectures @@ -39,20 +41,57 @@ The script can be also used as an **advanced example** that demonstrate [how to ## Best practices +### Modeling inputs features +Neural networks operate on top of dense / continuous float inputs. Continuous features fit nicely into that format, but categorical features needs to be represented accordingly. +It is assumed that in the preprocessing the categorical features were encoded as contiguous ids. Then, they are typically be represented by the model using: +- **One-hot encoding** Sparse representation where each categorical value is represented by a binary feature with 1 only for the actual value. If the categorical feature contains a list of values, it can be encoded with multi-hot encoding, with 1s for all values in the list. This encoding is useful to represent low-cardinality categorical features or to provide input to linear models. +- **Embedding** - This encoding is very popular for deep neural networks. Each categorical value is mapped to a 1D continuous vector, that can be trainable or pre-trained. The embeddings are stored in embedding layers or tables, whose first dim in the cardinality of the categorical feature and 2nd dim is the embedding size. + +**Dealing with high-cardinality categorical features** +We explain in the [Quick-start preprocessing documentation](../preproc/README.md) that large services might have categorical features with very high cardinality (e.g. order of hundreds of millions or higher), like user id or item id. They typically require a high memory to be stored (e.g. with embedding tables) or processed (e.g. with one-hot encoding). In addition, most of the categorical values are very infrequent, for which it is not possible to learn good embeddings. + +The [preprocessing documentation](../preproc/README.md) describes some options to deal with the high-cardinality features: **Frequency capping**, **Filtering out rows with infrequent values** and **Hashing**. + +You might also decide to keep the original high-cardinality of the categorical features for better personalization level and accuracy. + +> The embedding tables are typically responsible for most of the parameters of Recommender System models. For large scale systems, where the number of users and items is in the order of hundreds of millions, it is typically needed to use a distributed embeddings solution, so that embedding embedding tables can be sharded in multiple compute devices (e.g. GPU, CPU). NVIDIA offers distributed embedding solutions that can be used with Merlin and Tensorflow. + +**TODO**: Add references to NVIDIA distributed embeddings solutions + +**Defining the embedding size** + +It is common sense that higher the cardinality of categorical feature the higher should be the embedding dimension, as its vector space gets more complex. + +Models library uses by default a heuristic method that sets embedding sizes based on cardinality (implementation [here](https://github.com/NVIDIA-Merlin/models/blob/a5e392cbc575fe984c96ddcbce696e4b71b7073d/merlin/models/utils/schema_utils.py#L169)), which can be scaled by `--embedding_sizes_multiplier`. Models library API also allow setting [specific embedding dims](https://nvidia-merlin.github.io/models/main/generated/merlin.models.tf.Embeddings.html#merlin.models.tf.Embeddings) for each / all categorical features -Neural networks typically use embeddings (1D continuous vectors) to represent categorical features as input. The embeddings are stored in embedding layers or tables, whose first dim in the cardinality of the categorical feature and 2nd dim is the embedding size. In order to minimize the memory requirements of the embedding table, **the categorical values need to be encoded into contiguous ids in the preprocessing**, which will define the size of the embedding table in the model. +> Some models supported by this script (DLRM and DeepFM) require the embedding sizes of categorical features to be the same (`--embedding_dim`) because of their feature interaction approach based on Factorization Machines. -- stl_positive_class_weight -- Negative sampling: in_batch_negatives_train +### Regularization +Neural networks typically require regularization in order to avoid overfitting, in particular if trained on small data or for many epochs that can make it memorize train set. This script provide typical regularization techniques like Dropout (`--dropout`) and L2 regularization of model parameters (--l2_reg) and embeddings (`--embeddings_l2_reg `). -Different ranking model characteristics -- STL: "mlp", "dcn", "dlrm", "deepfm", "wide_n_deep" -- MTL: "mmoe", "cgc","ple", +### Classes weights +Typically positive user interactions are just a small fraction of the items that were exposed to the users. That leads to class unbalanced targets. +A common technique to deal with this problem in machine learning is to assign a higher weight in the loss to the examples with infrequent targets - positive classes in this case. +You might set the positive class weight for single-task learning models with `--stl_positive_class_weight` and for multi-task learning you can set the class weight for each target separately by using `--mtl_pos_class_weight_*`, where `*` must be replaced by the target name. In this case, the negative class weight is always 1.0 -### Setting tasks sample space -In that dataset, some targets depend on others. For example, you only have a `like/follow/share=1` event if the user has clicked in the item. The learning of the dependent tasks is better if we set the appropriate sample space for the targets. In this case, we want to train the `click` target using the entire space, and train the other targets (i.e., compute the loss) only for click space (where `click=1`). +### Losses weights -The scripts allows for setting the tasks sample space by using `--tasks_sample_space`, where the position should match the order of the `--tasks`. Empty value means the task will be trained in the entire space, i.e., loss computed for all examples in the dataset. +You can balance the learning of the multiple tasks by setting losses weights when using multi-task learning. You can set them by providing `--mtl_loss_weight_*` for each task, where `*` must be replaced by the target name. + +### Negative sampling +TODO: Describe --in_batch_negatives_train and --in_batch_negatives_eval for positive only data + + +### Multi-task learning: Setting tasks sample space +Some targets might depend on other target columns for some datasets. For example, the preprocessed TenRec dataset have positive (=1) `like`, `follow`, and `share` events only if the user has also clicked on the item (`click=1`). + +You might want to model dependent tasks explicitly by setting the sample space, i.e., computing the loss of the dependent tasks only for examples where the dominant target is 1. That would make the dependent targets less sparser, as their value is always 0 when dominant target is 0. + +The scripts allows setting the tasks sample space by using `--tasks_sample_space`, which accepts comma-separated values. The order of sample spaces should match the order of the `--tasks`. Empty value means the task will be trained in the entire space, i.e., loss computed for all examples in the dataset. +For TenRec dataset, you could use `--tasks click,like,share,follow` and `--tasks_sample_space=,click,click,click`, meaning that the click task will be trained using the entire space and the other tasks will be trained only in click space. + +>

We have observed empirically that if you want a model to predict all tasks at the same time (e.g. the likelihood of a user to click-like-share a post), it is better to train all tasks using entire space. On the other hand, if you want to train a MTL model that predicts rarer events (e.g. add-to-cart, purchase) given prior events (e.g. click), then you typically get better accuracy on the dependent tasks training them in the dominant task space, while training the dominant task on entire space. +

## Command line arguments @@ -307,4 +346,6 @@ The scripts allows for setting the tasks sample space by using `--tasks_sample_s Format of the output prediction files. By default 'parquet', which is the most performant format. -``` \ No newline at end of file +``` + +Lamp icon created by Freepik - Flaticon \ No newline at end of file