Add information about the Merlin DAG

- Define the important terms of the DAG. - Incorporate Karl's information. - Karl's info about Operators and ColumnSelectors.
NVIDIA-Merlin · Nov 30, 2022 · ef04422 · ef04422
1 parent c2f9519
commit ef04422
Show file tree

Hide file tree

Showing 12 changed files with 1,000 additions and 5 deletions.
diff --git a/docs/images/graph_schema.svg b/docs/images/graph_schema.svg
diff --git a/docs/images/graph_simple.svg b/docs/images/graph_simple.svg
diff --git a/docs/requirements-doc.txt b/docs/requirements-doc.txt
@@ -18,5 +18,4 @@ mergedeep<1.4
 docker<5.1
 PyGithub<1.56
 semver>=2,<3
-pytest<7.3
-coverage<6.6
+
diff --git a/docs/source/about-dag.md b/docs/source/about-dag.md
@@ -0,0 +1,84 @@
+# About the Merlin Directed Acyclic Graph
+
+```{contents}
+---
+depth: 2
+local: true
+backlinks: none
+---
+```
+
+Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.
+
+Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.
+
+## Graph Terminology
+
+node
+: A node in the DAG is a group of columns and at least one _operator_.
+  The columns are specified with a _column selector_.
+  A node has an _input schema_ and an _output schema_.
+  Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.
+
+column selector
+: A column selector specifies the columns to select from a dataset using column names or _tags_.
+
+operator
+: An operator performs a transformation on data and return a new _node_.
+  The data is identified by the _column selector_.
+  Some simple operators like `+` and `-` add or remove columns.
+  More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.
+
+schema
+: A Merlin schema is metadata that describes the columns in a dataset.
+  Each column has its own schema that identifies the column name and can specify _tags_ and properties.
+
+tag
+: A Merlin tag categorizes information about a column.
+  Adding a tag to a column enables you to select columns for operations by tag rather than name.
+
+  For example, you can add the `USER` and `ITEM` tags to columns.
+  Modeling and inference operations can use that information to act accordingly on the dataset.
+
+## Understanding Operators, Columns, Nodes, and Schema
+
+Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
+The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side."
+
+You can specify an explicit list of columns to run an Operator on just the specified columns.
+The following code block shows the syntax for explicit column names:
+
+```python
+result = ["col1", "col2",] >> SomeOperator(...)
+```
+
+Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:
+
+```python
+result = AnOperator(...) >> OtherOperator(...)
+```
+
+Chaining Operators together builds a graph.
+The following figure shows how each node in the graph has an Operator.
+
+![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)
+
+Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator.
+The following figure represents an Operator that adds `colB` to a dataset.
+
+![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)
+
+In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph.
+This is for two reasons:
+
+1. Merlin enables you to build graphs that process categories of columns.
+   The categories are specified by _tags_ instead of an explicit list of column names.
+
+   For example, you can select the continuous columns from your dataset with code like the following example:
+
+   ```python
+   [Tags.CONTINUOUS] >> Operator(...)
+   ```
+
+1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
+   The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.
diff --git a/docs/source/about-model-blocks.md b/docs/source/about-model-blocks.md
@@ -0,0 +1,3 @@
+# About Merlin Model Blocks
+
+FIXME
diff --git a/docs/source/about-operators.md b/docs/source/about-operators.md
@@ -0,0 +1,85 @@
+# About Merlin Operators
+
+```{contents}
+---
+depth: 2
+local: true
+backlinks: none
+---
+```
+
+## Understanding Operators
+
+Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on.
+
+An Operator implements two key methods:
+
+Fit
+: The `fit` method performs any pre-computation steps that are required before operating on data.
+
+  For example, the `Normalize` Operator normalizes the values of a continuous column between 0 and 1.
+  The `fit` method determines the minimum and maximum values.
+
+  The method is optional.
+  For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping.
+  These Operators do not need to access the data to perform any pre-computation steps.
+
+Transform
+: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping.
+
+Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object.
+The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time.
+
+```python
+# Typical signature of a fit method.
+def fit(
+    self,
+    selector: ColumnSelector,
+    dataset: Dataset
+) -> Any
+
+# Typical signature of a transform method.
+def transform(
+    self,
+    selector: ColumnSelector,
+    df: DataFrame
+) -> DataFrame
+```
+
+## Operators and Columns: Column Selector
+
+In most cases, you want an Operator to process a subset of the columns in your input dataset.
+Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on.
+Merlin uses a `ColumnSelector` class to represent the columns.
+
+The simplest column selector is a list of strings that specify some column names.
+In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class.
+
+```python
+result = ["col1", "col2"] >> SomeOperator(...)
+```
+
+Column selectors also offer a more powerful and flexible way to specify columns.
+You can specify the input columns to an Operator with tags.
+In the following sample code, the Operator processes all the continuous variables in a dataset.
+
+```python
+result = [Tags.CONTINUOUS] >> SomeOperator(...)
+```
+
+Using tags to create a column selector offers the following advantages:
+
+- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables.
+- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset.
+- Simplifies code by avoiding lists of strings for column names.
+
+## How to Build an Operator
+
+Blah.
+
+## Reference Documentation
+
+- {py:class}`merlin.dag.BaseOperator`
+- {py:class}`merlin.dag.ColumnSelector`
+- {py:class}`merlin.schema.Tags`
+- {py:class}`merlin.io.DataSet`
diff --git a/docs/source/about-schema.md b/docs/source/about-schema.md
@@ -0,0 +1,3 @@
+# About the Merlin Schema
+
+FIXME
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -118,6 +118,14 @@
 
 autosummary_generate = True
 
+intersphinx_mapping = {
+    "python": ("https://docs.python.org/3", None),
+    "merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
+    "merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
+    "merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
+    "NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
+}
+
 copydirs_additional_dirs = ["../../examples/", "../../README.md"]
 
 copydirs_file_rename = {

diff --git a/docs/source/technical-concepts.md b/docs/source/technical-concepts.md
@@ -0,0 +1,4 @@
+# Merlin Technical Concepts
+
+The following pages provide a deeper technical understanding of Merlin concepts.
+These concepts can help you to develop your own operator to implement a more sophisticated recommender system.
diff --git a/docs/source/toc.yaml b/docs/source/toc.yaml
@@ -46,5 +46,13 @@ subtrees:
                     title: Deploy the HugeCTR Model with Triton
                   - file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
                     title: Deploy the TensorFlow Model with Triton
+      - title: Merlin Technical Concepts
+        file: technical-concepts.md
+        entries:
+          - file: about-dag.md
+            title: Graph Concepts
+          - file: about-schema.md
+          - file: about-operators.md
+          - file: about-model-blocks.md
       - file: containers.rst
       - file: support_matrix/index.rst
diff --git a/requirements/docs.txt b/requirements/docs.txt
@@ -0,0 +1,22 @@
+# docs
+ipython==8.2.0
+Sphinx==3.5.4
+jinja2<3.1
+markupsafe==2.0.1
+natsort==8.1.0
+sphinx_rtd_theme
+sphinx_markdown_tables
+sphinx-multiversion@git+https://github.com/mikemckiernan/[email protected]
+sphinxcontrib-copydirs@git+https://github.com/mikemckiernan/[email protected]
+sphinx-external-toc<0.4
+myst-nb
+linkify-it-py
+Markdown==3.3.7
+
+# smx
+mergedeep<1.4
+docker<5.1
+PyGithub<1.56
+semver>=2,<3
+pytest<7.3
+coverage<6.6
diff --git a/tox.ini b/tox.ini
@@ -36,14 +36,14 @@ commands =
 ; Generates documentation with sphinx. There are other steps in the Github Actions workflow
 ; to publish the documentation on release.
 changedir = {toxinidir}
-deps = -rrequirements/docs.txt
+deps = -r requirements/docs.txt
 commands =
-    python -m sphinx.cmd.build -P -b html docs/source docs/build/html
+    python -m sphinx.cmd.build -P -b {posargs:html} docs/source docs/build/{posargs:html}
 
 [testenv:docs-multi]
 ; Run the multi-version build that is shown on GitHub Pages.
 changedir = {toxinidir}
-deps = -rrequirements/docs.txt
+deps = -r requirements/docs.txt
 commands =
     sphinx-multiversion --dump-metadata docs/source docs/build/html | jq "keys"
     sphinx-multiversion docs/source docs/build/html