Add information about the Merlin DAG

- Define the important terms of the DAG. - Incorporate Karl's information.
NVIDIA-Merlin · Nov 30, 2022 · dbb0b19 · dbb0b19
1 parent c2f9519
commit dbb0b19
Show file tree

Hide file tree

Showing 11 changed files with 897 additions and 3 deletions.
diff --git a/docs/images/graph_schema.svg b/docs/images/graph_schema.svg
diff --git a/docs/images/graph_simple.svg b/docs/images/graph_simple.svg
diff --git a/docs/source/about-dag.md b/docs/source/about-dag.md
@@ -0,0 +1,84 @@
+# About the Merlin Directed Acyclic Graph
+
+```{contents}
+---
+depth: 2
+local: true
+backlinks: none
+---
+```
+
+Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference.
+
+Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin.
+
+## Graph Terminology
+
+node
+: A node in the DAG is a group of columns and at least one _operator_.
+  The columns are specified with a _column selector_.
+  A node has an _input schema_ and an _output schema_.
+  Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset.
+
+column selector
+: A column selector specifies the columns to select from a dataset using column names or _tags_.
+
+operator
+: An operator performs a transformation on data and return a new _node_.
+  The data is identified by the _column selector_.
+  Some simple operators like `+` and `-` add or remove columns.
+  More complex operations are applied by shifting the operators onto the column selector with the `>>` notation.
+
+schema
+: A Merlin schema is metadata that describes the columns in a dataset.
+  Each column has its own schema that identifies the column name and can specify _tags_ and properties.
+
+tag
+: A Merlin tag categorizes information about a column.
+  Adding a tag to a column enables you to select columns for operations by tag rather than name.
+
+  For example, you can add the `USER` and `ITEM` tags to columns.
+  Modeling and inference operations can use that information to act accordingly on the dataset.
+
+## Understanding Operators, Columns, Nodes, and Schema
+
+Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows.
+The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side."
+
+You can specify an explicit list of columns to run an Operator on just the specified columns.
+The following code block shows the syntax for explicit column names:
+
+```python
+result = ["col1", "col2",] >> SomeOperator(...)
+```
+
+Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator:
+
+```python
+result = AnOperator(...) >> OtherOperator(...)
+```
+
+Chaining Operators together builds a graph.
+The following figure shows how each node in the graph has an Operator.
+
+![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg)
+
+Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator.
+The following figure represents an Operator that adds `colB` to a dataset.
+
+![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg)
+
+In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph.
+This is for two reasons:
+
+1. Merlin enables you to build graphs that process categories of columns.
+   The categories are specified by _tags_ instead of an explicit list of column names.
+
+   For example, you can select the continuous columns from your dataset with code like the following example:
+
+   ```python
+   [Tags.CONTINUOUS] >> Operator(...)
+   ```
+
+1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset.
+   The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names.
diff --git a/docs/source/about-model-blocks.md b/docs/source/about-model-blocks.md
@@ -0,0 +1,3 @@
+# About Merlin Model Blocks
+
+FIXME
diff --git a/docs/source/about-operators.md b/docs/source/about-operators.md
@@ -0,0 +1,5 @@
+# About Merlin Operators
+
+## How to Build an Operator
+
+FIXME
diff --git a/docs/source/about-schema.md b/docs/source/about-schema.md
@@ -0,0 +1,3 @@
+# About the Merlin Schema
+
+FIXME
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -118,6 +118,14 @@
 
 autosummary_generate = True
 
+intersphinx_mapping = {
+    "python": ("https://docs.python.org/3", None),
+    "merlin-core": ("https://nvidia-merlin.github.io/core/main", None),
+    "merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None),
+    "merlin-models": ("https://nvidia-merlin.github.io/models/main", None),
+    "NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None),
+}
+
 copydirs_additional_dirs = ["../../examples/", "../../README.md"]
 
 copydirs_file_rename = {

diff --git a/docs/source/technical-concepts.md b/docs/source/technical-concepts.md
@@ -0,0 +1,4 @@
+# Merlin Technical Concepts
+
+The following pages provide a deeper technical understanding of Merlin concepts.
+These concepts can help you to develop your own operator to implement a more sophisticated recommender system.
diff --git a/docs/source/toc.yaml b/docs/source/toc.yaml
@@ -46,5 +46,13 @@ subtrees:
                     title: Deploy the HugeCTR Model with Triton
                   - file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
                     title: Deploy the TensorFlow Model with Triton
+      - title: Merlin Technical Concepts
+        file: technical-concepts.md
+        entries:
+          - file: about-dag.md
+            title: Graph Concepts
+          - file: about-schema.md
+          - file: about-operators.md
+          - file: about-model-blocks.md
       - file: containers.rst
       - file: support_matrix/index.rst
diff --git a/docs/requirements-doc.txt → requirements/docs.txt b/docs/requirements-doc.txt → requirements/docs.txt
diff --git a/tox.ini b/tox.ini
@@ -36,14 +36,14 @@ commands =
 ; Generates documentation with sphinx. There are other steps in the Github Actions workflow
 ; to publish the documentation on release.
 changedir = {toxinidir}
-deps = -rrequirements/docs.txt
+deps = -r requirements/docs.txt
 commands =
-    python -m sphinx.cmd.build -P -b html docs/source docs/build/html
+    python -m sphinx.cmd.build -P -b {posargs:html} docs/source docs/build/{posargs:html}
 
 [testenv:docs-multi]
 ; Run the multi-version build that is shown on GitHub Pages.
 changedir = {toxinidir}
-deps = -rrequirements/docs.txt
+deps = -r requirements/docs.txt
 commands =
     sphinx-multiversion --dump-metadata docs/source docs/build/html | jq "keys"
     sphinx-multiversion docs/source docs/build/html