Skip to content

Commit

Permalink
Add information about the Merlin DAG
Browse files Browse the repository at this point in the history
Define the important terms of the DAG.
  • Loading branch information
mikemckiernan committed Nov 18, 2022
1 parent 980e297 commit 5470be3
Show file tree
Hide file tree
Showing 3 changed files with 79 additions and 0 deletions.
69 changes: 69 additions & 0 deletions docs/source/about-dag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# About the Merlin Directed Acyclic Graph

Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.

Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.

## DAG Terminology

node
: A node in the DAG is a group of columns and at least one _operator_.
The columns are specified with a _column selector_.
A node has an _input schema_ and an _output schema_.
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.

column selector
: A column selector specifies the columns to select from a dataset using column names or _tags_.

operator
: An operator performs a transformation on data and return a new _node_.
The data is identified by the _column selector_.
Some simple operators like `+` and `-` add or remove columns.
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.

schema
: A Merlin schema is metadata that describes the columns in a dataset.
Each column has its own schema that identifies the column name and can specify _tags_ and properties.

tag
: A Merlin tag categorizes information about a column.
Adding a tag to a column enables you to select columns for operations by tag rather than name.

For example, you can add the `USER` and `ITEM` tags to columns.
Modeling and inference operations can use that information to act accordingly on the dataset.

## Syntax and Sample Code

The following code block shows the typical syntax for building a workflow that operates on DAG components.

```{rubric} Syntax
```

```python
result = [column_selector, ...] >> op1 >> op2 >> ...;
```

Starting with the `column_selector`, the brackets group one or more column selectors that identify columns in the input data.

The `op1` and `op2` represent operators.
When an operator performs its operation on the input data, the operator returns a node.

The `result` object is the graph.
It contains the sequence of operations to perform.

```{rubric} Sample Code
```

```python
item_features = (
["item_category", "item_shop", "item_brand"] >> Categorify(dtype="int32") >> TagAsItemFeatures()
)
```

In the sample code, the column selector is created by specifying the item-related column names.

The {py:class}`~nvtabular.ops.Categorify` operator transforms the categorical features into unique integer values, adds the {py:attr}`~merlin.schema.Tags.CATEGORICAL` tag, and returns a node.

The {py:class}`~nvtabular.ops.TagAsItemFeatures` operator applies the {py:attr}`~merlin.schema.Tags.ITEM` tag and returns a node.

When the `item_features` variable is included in a transformation and applied to input data, it will traverse the nodes in order and apply the data transformation and tagging.
8 changes: 8 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@

autosummary_generate = True

intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
"merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
"merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
"merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
"NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
}

copydirs_additional_dirs = ["../../examples/", "../../README.md"]

copydirs_file_rename = {
Expand Down
2 changes: 2 additions & 0 deletions docs/source/toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,7 @@ subtrees:
title: Deploy the HugeCTR Model with Triton
- file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
title: Deploy the TensorFlow Model with Triton
- file: about-dag.md
title: Merlin DAG
- file: containers.rst
- file: support_matrix/index.rst

0 comments on commit 5470be3

Please sign in to comment.