Skip to content

Commit

Permalink
Add information about the Merlin DAG
Browse files Browse the repository at this point in the history
- Define the important terms of the DAG.
- Incorporate Karl's information.
- Karl's info about Operators and ColumnSelectors.
  • Loading branch information
mikemckiernan committed Nov 30, 2022
1 parent c2f9519 commit ef04422
Show file tree
Hide file tree
Showing 12 changed files with 1,000 additions and 5 deletions.
410 changes: 410 additions & 0 deletions docs/images/graph_schema.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
369 changes: 369 additions & 0 deletions docs/images/graph_simple.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 1 addition & 2 deletions docs/requirements-doc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,4 @@ mergedeep<1.4
docker<5.1
PyGithub<1.56
semver>=2,<3
pytest<7.3
coverage<6.6

84 changes: 84 additions & 0 deletions docs/source/about-dag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# About the Merlin Directed Acyclic Graph

```{contents}
---
depth: 2
local: true
backlinks: none
---
```

Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.

Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.

## Graph Terminology

node
: A node in the DAG is a group of columns and at least one _operator_.
The columns are specified with a _column selector_.
A node has an _input schema_ and an _output schema_.
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.

column selector
: A column selector specifies the columns to select from a dataset using column names or _tags_.

operator
: An operator performs a transformation on data and return a new _node_.
The data is identified by the _column selector_.
Some simple operators like `+` and `-` add or remove columns.
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.

schema
: A Merlin schema is metadata that describes the columns in a dataset.
Each column has its own schema that identifies the column name and can specify _tags_ and properties.

tag
: A Merlin tag categorizes information about a column.
Adding a tag to a column enables you to select columns for operations by tag rather than name.

For example, you can add the `USER` and `ITEM` tags to columns.
Modeling and inference operations can use that information to act accordingly on the dataset.

## Understanding Operators, Columns, Nodes, and Schema

Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side."

You can specify an explicit list of columns to run an Operator on just the specified columns.
The following code block shows the syntax for explicit column names:

```python
result = ["col1", "col2",] >> SomeOperator(...)
```

Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:

```python
result = AnOperator(...) >> OtherOperator(...)
```

Chaining Operators together builds a graph.
The following figure shows how each node in the graph has an Operator.

![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)

Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator.
The following figure represents an Operator that adds `colB` to a dataset.

![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)

In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph.
This is for two reasons:

1. Merlin enables you to build graphs that process categories of columns.
The categories are specified by _tags_ instead of an explicit list of column names.

For example, you can select the continuous columns from your dataset with code like the following example:

```python
[Tags.CONTINUOUS] >> Operator(...)
```

1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.
3 changes: 3 additions & 0 deletions docs/source/about-model-blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# About Merlin Model Blocks

FIXME
85 changes: 85 additions & 0 deletions docs/source/about-operators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# About Merlin Operators

```{contents}
---
depth: 2
local: true
backlinks: none
---
```

## Understanding Operators

Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on.

An Operator implements two key methods:

Fit
: The `fit` method performs any pre-computation steps that are required before operating on data.

For example, the `Normalize` Operator normalizes the values of a continuous column between 0 and 1.
The `fit` method determines the minimum and maximum values.

The method is optional.
For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping.
These Operators do not need to access the data to perform any pre-computation steps.

Transform
: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping.

Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object.
The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time.

```python
# Typical signature of a fit method.
def fit(
self,
selector: ColumnSelector,
dataset: Dataset
) -> Any

# Typical signature of a transform method.
def transform(
self,
selector: ColumnSelector,
df: DataFrame
) -> DataFrame
```

## Operators and Columns: Column Selector

In most cases, you want an Operator to process a subset of the columns in your input dataset.
Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on.
Merlin uses a `ColumnSelector` class to represent the columns.

The simplest column selector is a list of strings that specify some column names.
In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class.

```python
result = ["col1", "col2"] >> SomeOperator(...)
```

Column selectors also offer a more powerful and flexible way to specify columns.
You can specify the input columns to an Operator with tags.
In the following sample code, the Operator processes all the continuous variables in a dataset.

```python
result = [Tags.CONTINUOUS] >> SomeOperator(...)
```

Using tags to create a column selector offers the following advantages:

- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables.
- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset.
- Simplifies code by avoiding lists of strings for column names.

## How to Build an Operator

Blah.

## Reference Documentation

- {py:class}`merlin.dag.BaseOperator`
- {py:class}`merlin.dag.ColumnSelector`
- {py:class}`merlin.schema.Tags`
- {py:class}`merlin.io.DataSet`
3 changes: 3 additions & 0 deletions docs/source/about-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# About the Merlin Schema

FIXME
8 changes: 8 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@

autosummary_generate = True

intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
"merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
"merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
"merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
"NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
}

copydirs_additional_dirs = ["../../examples/", "../../README.md"]

copydirs_file_rename = {
Expand Down
4 changes: 4 additions & 0 deletions docs/source/technical-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Merlin Technical Concepts

The following pages provide a deeper technical understanding of Merlin concepts.
These concepts can help you to develop your own operator to implement a more sophisticated recommender system.
8 changes: 8 additions & 0 deletions docs/source/toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,13 @@ subtrees:
title: Deploy the HugeCTR Model with Triton
- file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
title: Deploy the TensorFlow Model with Triton
- title: Merlin Technical Concepts
file: technical-concepts.md
entries:
- file: about-dag.md
title: Graph Concepts
- file: about-schema.md
- file: about-operators.md
- file: about-model-blocks.md
- file: containers.rst
- file: support_matrix/index.rst
22 changes: 22 additions & 0 deletions requirements/docs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# docs
ipython==8.2.0
Sphinx==3.5.4
jinja2<3.1
markupsafe==2.0.1
natsort==8.1.0
sphinx_rtd_theme
sphinx_markdown_tables
sphinx-multiversion@git+https://github.com/mikemckiernan/[email protected]
sphinxcontrib-copydirs@git+https://github.com/mikemckiernan/[email protected]
sphinx-external-toc<0.4
myst-nb
linkify-it-py
Markdown==3.3.7

# smx
mergedeep<1.4
docker<5.1
PyGithub<1.56
semver>=2,<3
pytest<7.3
coverage<6.6
6 changes: 3 additions & 3 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ commands =
; Generates documentation with sphinx. There are other steps in the Github Actions workflow
; to publish the documentation on release.
changedir = {toxinidir}
deps = -rrequirements/docs.txt
deps = -r requirements/docs.txt
commands =
python -m sphinx.cmd.build -P -b html docs/source docs/build/html
python -m sphinx.cmd.build -P -b {posargs:html} docs/source docs/build/{posargs:html}

[testenv:docs-multi]
; Run the multi-version build that is shown on GitHub Pages.
changedir = {toxinidir}
deps = -rrequirements/docs.txt
deps = -r requirements/docs.txt
commands =
sphinx-multiversion --dump-metadata docs/source docs/build/html | jq "keys"
sphinx-multiversion docs/source docs/build/html
Expand Down

0 comments on commit ef04422

Please sign in to comment.