Skip to content

Commit

Permalink
Add information about the Merlin DAG
Browse files Browse the repository at this point in the history
- Define the important terms of the DAG.
- Incorporate Karl's information.
- Karl's info about Operators and ColumnSelectors.
- Karl's info about Dataset.
  • Loading branch information
mikemckiernan committed Dec 2, 2022
1 parent 5388a1d commit d9a832e
Show file tree
Hide file tree
Showing 15 changed files with 2,837 additions and 6 deletions.
1,196 changes: 1,196 additions & 0 deletions docs/images/dataset_and_dataframe.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
410 changes: 410 additions & 0 deletions docs/images/graph_schema.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
369 changes: 369 additions & 0 deletions docs/images/graph_simple.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
555 changes: 555 additions & 0 deletions docs/images/parquet_and_dataset.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 1 addition & 2 deletions docs/requirements-doc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,4 @@ mergedeep<1.4
docker<5.1
PyGithub<1.56
semver>=2,<3
pytest<7.3
coverage<6.6

107 changes: 107 additions & 0 deletions docs/source/about-dag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# About the Merlin Graph

```{contents}
---
depth: 2
local: true
backlinks: none
---
```

## Purpose of the Merlin Graph

Merlin uses a directed acyclic graph (DAG) to represent operations on data such as normalizing or clipping values and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.

Understanding the Merlin DAG is helpful if you want to develop your own Operator or building a recommender system with Merlin.

## Graph Terminology

node
: A node in the DAG is a group of columns and at least one _operator_.
The columns are specified with a _column selector_.
A node has an _input schema_ and an _output schema_.
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.

column selector
: A column selector specifies the columns to select from a dataset using column names or _tags_.

operator
: An operator performs a transformation on data and return a new _node_.
The data is identified by the _column selector_.
Some simple operators like `+` and `-` add or remove columns.
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.

schema
: A Merlin schema is metadata that describes the columns in a dataset.
Each column has its own schema that identifies the column name and can specify _tags_ and properties.

tag
: A Merlin tag categorizes information about a column.
Adding a tag to a column enables you to select columns for operations by tag rather than name.

For example, you can add the `CONTINUOUS` or `CATEGORICAL` tags to columns.
Feature engineering Operators, modeling, and inference operations can use that information to operate accordingly on the dataset.

## Introduction to Operators, Columns, Nodes, and Schema

The NVTabular library uses Operators for feature engineering.
One example of an NVTabular Operator is `Normalize`.
The Operator normalizes continuous variables between `0` and `1`.

The Merlin Systems library uses Operators for building ensembles and performing inference.
The library includes Operators such as `FilterCandidates` and `PredictTensorflow`.
You use these Operators to put your models into production and serve recommendations.

Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
The `>>` syntax means "take the output columns from the left-hand side and feed them as the input columns to the right-hand side."

You can specify an explicit list of columns names for an Operator.
The following code block shows the syntax for explicit column names:

```python
result = ["col1", "col2",] >> SomeOperator(...)
```

Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:

```python
result = AnOperator(...) >> OtherOperator(...)
```

Chaining Operators together builds a graph.
The following figure shows how each node in the graph has an Operator.

![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)

```{tip}
After you build an NVTabular workflow or Merlin Systems transform workflow, you can visualize the graph and create an image like the preceding example by running the `graph` method.
```

Each node in a graph has an input schema and an output schema that describe the input columns to the Operator and the output columns produced by the Operator.
The following figure represents an Operator, `SomeOperator`, that adds `colB` to a dataset.

![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)

In practice, when Merlin first builds the graph, the workflow does not initially know which columns are input or output.
This is for two reasons:

1. Merlin enables you to build graphs that process categories of columns.
The categories are specified by _tags_ instead of an explicit list of column names.

For example, you can select the continuous columns from your dataset with code like the following example:

```python
[Tags.CONTINUOUS] >> Operator(...)
```

1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.

## Reference Documentation

- {py:class}`nvtabular.ops.Normalize`
- {py:class}`nvtabular.workflow.workflow.Workflow`
- {py:class}`merlin.systems.dag.ops.workflow.TransformWorkflow`
- {py:class}`merlin.systems.dag.Ensemble`
- {py:class}`merlin.systems.dag.ops.session_filter.FilterCandidates`
- {py:class}`merlin.systems.dag.tensorflow.PredictTensorFlow`
58 changes: 58 additions & 0 deletions docs/source/about-dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# About the Merlin Dataset

```{contents}
---
depth: 2
local: true
backlinks: none
---
```

## On-disk Representation

The Apache Parquet file format is the most-frequently used file format for Merlin datasets.

Parquet is a columnar storage format.
The format arranges the values for each column in a long list.
This format is in contrast with a row-oriented format---such as a comma-separated values format---that arranges all the data for one row together.

As an analogy, columnar storage is like a dictionary of columns instead of row-oriented storage that is like a list of rows.

In most cases, a Parquet dataset includes multiple files in one or more directories.

![The Merlin dataset class can read a directory of Parquet files for data access.](../images/parquet_and_dataset.svg)

The Merlin dataset class, `merlin.io.Dataset`, treats a collection of many Parquet files as a single dataset.
By treating the collection as a single dataset, Merlin simplifies distributing computation over multiple GPUs or multiple machines.

The dataset class is not a copy of the data or a modification of the Parquet files.
An instance of the class is similar to a collection of pointers to the Parquet files.

When you create an instance of the dataset class, Merlin attempts to infer a schema by reading one record of the data.
Merlin attempts to determine the column names and data types.

## Processing Data: Dataset and DataFrame

When you perform a computation on a Merlin dataset, the dataset reads from the files on disk and converts them into a set of DataFrames.
The DataFrames, like Parquet files, use a columnar storage format.
The API for a DataFrame is similar to a Python dictionary---you can reference a column with syntax like `dataframe['col1']`.

![A Merlin dataset reads data from disk and becomes several DataFrames.](../images/dataset_and_dataframe.svg)

Merlin processes each DataFrame individually and aggregates the results across the DataFrames as needed.
There are two kinds of computations that you can perform on a dataset: `fit` and `transform`.

The `fit` computations perform a full pass over the dataset to compute statistics, find unique values, perform grouping, or another operation that requires information from multiple DataFrames.

The `transform` computations process each DataFrame individually.
These computations use the information gathered from `fit` to alter the DataFrame.
For example the `Normalize` and `Clip` Operators compute new values for columns and the `Rename` Operator adds and removes columns.

More information about the `fit` and `transform` methods is provided in [](./about-operators.md).

## Reference Documentation

- {py:class}`merlin.io.Dataset`
- {py:class}`nvtabular.ops.Normalize`
- {py:class}`nvtabular.ops.Clip`
- {py:class}`nvtabular.ops.Rename`
3 changes: 3 additions & 0 deletions docs/source/about-model-blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# About Merlin Model Blocks

FIXME
89 changes: 89 additions & 0 deletions docs/source/about-operators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# About Merlin Operators

```{contents}
---
depth: 2
local: true
backlinks: none
---
```

## Understanding Operators

Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on.

An Operator implements two key methods:

Fit
: The `fit` method performs any pre-computation steps that are required before modifying the data.

For example, the `Normalize` Operator normalizes the values of a continuous variable between `0` and `1`.
The `fit` method determines the minimum and maximum values.

The method is optional.
For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping.
These Operators do not need to access the data to perform any pre-computation steps.

Transform
: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping.
This method modifies the data.

Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object.
The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time.

```{code-block} python
---
emphasize-lines: 5, 12
---
# Typical signature of a fit method.
def fit(
self,
selector: ColumnSelector,
dataset: Dataset
) -> Any
# Typical signature of a transform method.
def transform(
self,
selector: ColumnSelector,
df: DataFrame
) -> DataFrame
```

## Operators and Columns: Column Selector

In most cases, you want an Operator to process a subset of the columns in your input dataset.
Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on.
Merlin uses a `ColumnSelector` class to represent the columns.

The simplest column selector is a list of strings that specify some column names.
In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class.

```python
result = ["col1", "col2"] >> SomeOperator(...)
```

Column selectors also offer a more powerful and flexible way to specify columns.
You can specify the input columns to an Operator with tags.
In the following sample code, the Operator processes all the continuous variables in a dataset.

```python
result = [Tags.CONTINUOUS] >> SomeOperator(...)
```

Using tags to create a column selector offers the following advantages:

- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables.
- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset.
- Simplifies code by avoiding lists of strings for column names.

## How to Build an Operator

Blah.

## Reference Documentation

- {py:class}`merlin.dag.BaseOperator`
- {py:class}`merlin.dag.ColumnSelector`
- {py:class}`merlin.schema.Tags`
- {py:class}`merlin.io.DataSet`
3 changes: 3 additions & 0 deletions docs/source/about-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# About the Merlin Schema

FIXME
10 changes: 9 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
#
html_theme = "sphinx_rtd_theme"
html_theme_options = {
"navigation_depth": 3,
"titles_only": True,
"analytics_id": "G-NVJ1Y1YJHK",
}
html_copy_source = False
Expand Down Expand Up @@ -118,6 +118,14 @@

autosummary_generate = True

intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
"merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
"merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
"merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
"NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
}

copydirs_additional_dirs = ["../../examples/", "../../README.md"]

copydirs_file_rename = {
Expand Down
4 changes: 4 additions & 0 deletions docs/source/technical-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Merlin Technical Concepts

The following pages provide a deeper technical understanding of Merlin concepts.
These concepts can help you to develop your own operator to implement a more sophisticated recommender system.
8 changes: 8 additions & 0 deletions docs/source/toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,13 @@ subtrees:
title: Deploy the HugeCTR Model with Triton
- file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
title: Deploy the TensorFlow Model with Triton
- title: Merlin Technical Concepts
file: technical-concepts.md
entries:
- file: about-dag.md
- file: about-dataset.md
- file: about-schema.md
- file: about-operators.md
- file: about-model-blocks.md
- file: containers.rst
- file: support_matrix/index.rst
22 changes: 22 additions & 0 deletions requirements/docs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# docs
ipython==8.2.0
Sphinx==3.5.4
jinja2<3.1
markupsafe==2.0.1
natsort==8.1.0
sphinx_rtd_theme
sphinx_markdown_tables
sphinx-multiversion@git+https://github.com/mikemckiernan/[email protected]
sphinxcontrib-copydirs@git+https://github.com/mikemckiernan/[email protected]
sphinx-external-toc<0.4
myst-nb
linkify-it-py
Markdown==3.3.7

# smx
mergedeep<1.4
docker<5.1
PyGithub<1.56
semver>=2,<3
pytest<7.3
coverage<6.6
6 changes: 3 additions & 3 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ commands =
; Generates documentation with sphinx. There are other steps in the Github Actions workflow
; to publish the documentation on release.
changedir = {toxinidir}
deps = -rrequirements/docs.txt
deps = -r requirements/docs.txt
commands =
python -m sphinx.cmd.build -P -b html docs/source docs/build/html
python -m sphinx.cmd.build -P -b {posargs:html} docs/source docs/build/{posargs:html}

[testenv:docs-multi]
; Run the multi-version build that is shown on GitHub Pages.
changedir = {toxinidir}
deps = -rrequirements/docs.txt
deps = -r requirements/docs.txt
commands =
sphinx-multiversion --dump-metadata docs/source docs/build/html | jq "keys"
sphinx-multiversion docs/source docs/build/html
Expand Down

0 comments on commit d9a832e

Please sign in to comment.