Skip to content

Commit

Permalink
Add information about the Merlin DAG
Browse files Browse the repository at this point in the history
- Define the important terms of the DAG.
- Incorporate Karl's information.
  • Loading branch information
mikemckiernan committed Nov 30, 2022
1 parent c2f9519 commit dbb0b19
Show file tree
Hide file tree
Showing 11 changed files with 897 additions and 3 deletions.
410 changes: 410 additions & 0 deletions docs/images/graph_schema.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
369 changes: 369 additions & 0 deletions docs/images/graph_simple.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 84 additions & 0 deletions docs/source/about-dag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# About the Merlin Directed Acyclic Graph

```{contents}
---
depth: 2
local: true
backlinks: none
---
```

Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.

Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.

## Graph Terminology

node
: A node in the DAG is a group of columns and at least one _operator_.
The columns are specified with a _column selector_.
A node has an _input schema_ and an _output schema_.
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.

column selector
: A column selector specifies the columns to select from a dataset using column names or _tags_.

operator
: An operator performs a transformation on data and return a new _node_.
The data is identified by the _column selector_.
Some simple operators like `+` and `-` add or remove columns.
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.

schema
: A Merlin schema is metadata that describes the columns in a dataset.
Each column has its own schema that identifies the column name and can specify _tags_ and properties.

tag
: A Merlin tag categorizes information about a column.
Adding a tag to a column enables you to select columns for operations by tag rather than name.

For example, you can add the `USER` and `ITEM` tags to columns.
Modeling and inference operations can use that information to act accordingly on the dataset.

## Understanding Operators, Columns, Nodes, and Schema

Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side."

You can specify an explicit list of columns to run an Operator on just the specified columns.
The following code block shows the syntax for explicit column names:

```python
result = ["col1", "col2",] >> SomeOperator(...)
```

Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:

```python
result = AnOperator(...) >> OtherOperator(...)
```

Chaining Operators together builds a graph.
The following figure shows how each node in the graph has an Operator.

![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)

Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator.
The following figure represents an Operator that adds `colB` to a dataset.

![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)

In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph.
This is for two reasons:

1. Merlin enables you to build graphs that process categories of columns.
The categories are specified by _tags_ instead of an explicit list of column names.

For example, you can select the continuous columns from your dataset with code like the following example:

```python
[Tags.CONTINUOUS] >> Operator(...)
```

1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.
3 changes: 3 additions & 0 deletions docs/source/about-model-blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# About Merlin Model Blocks

FIXME
5 changes: 5 additions & 0 deletions docs/source/about-operators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# About Merlin Operators

## How to Build an Operator

FIXME
3 changes: 3 additions & 0 deletions docs/source/about-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# About the Merlin Schema

FIXME
8 changes: 8 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@

autosummary_generate = True

intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
"merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
"merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
"merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
"NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
}

copydirs_additional_dirs = ["../../examples/", "../../README.md"]

copydirs_file_rename = {
Expand Down
4 changes: 4 additions & 0 deletions docs/source/technical-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Merlin Technical Concepts

The following pages provide a deeper technical understanding of Merlin concepts.
These concepts can help you to develop your own operator to implement a more sophisticated recommender system.
8 changes: 8 additions & 0 deletions docs/source/toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,13 @@ subtrees:
title: Deploy the HugeCTR Model with Triton
- file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
title: Deploy the TensorFlow Model with Triton
- title: Merlin Technical Concepts
file: technical-concepts.md
entries:
- file: about-dag.md
title: Graph Concepts
- file: about-schema.md
- file: about-operators.md
- file: about-model-blocks.md
- file: containers.rst
- file: support_matrix/index.rst
File renamed without changes.
6 changes: 3 additions & 3 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ commands =
; Generates documentation with sphinx. There are other steps in the Github Actions workflow
; to publish the documentation on release.
changedir = {toxinidir}
deps = -rrequirements/docs.txt
deps = -r requirements/docs.txt
commands =
python -m sphinx.cmd.build -P -b html docs/source docs/build/html
python -m sphinx.cmd.build -P -b {posargs:html} docs/source docs/build/{posargs:html}

[testenv:docs-multi]
; Run the multi-version build that is shown on GitHub Pages.
changedir = {toxinidir}
deps = -rrequirements/docs.txt
deps = -r requirements/docs.txt
commands =
sphinx-multiversion --dump-metadata docs/source docs/build/html | jq "keys"
sphinx-multiversion docs/source docs/build/html
Expand Down

0 comments on commit dbb0b19

Please sign in to comment.