From ef0442296c4eeeed3dc57a2836eb5e03a5cbf391 Mon Sep 17 00:00:00 2001 From: Mike McKiernan Date: Mon, 14 Nov 2022 11:24:31 -0500 Subject: [PATCH] Add information about the Merlin DAG - Define the important terms of the DAG. - Incorporate Karl's information. - Karl's info about Operators and ColumnSelectors. --- docs/images/graph_schema.svg | 410 ++++++++++++++++++++++++++++++ docs/images/graph_simple.svg | 369 +++++++++++++++++++++++++++ docs/requirements-doc.txt | 3 +- docs/source/about-dag.md | 84 ++++++ docs/source/about-model-blocks.md | 3 + docs/source/about-operators.md | 85 +++++++ docs/source/about-schema.md | 3 + docs/source/conf.py | 8 + docs/source/technical-concepts.md | 4 + docs/source/toc.yaml | 8 + requirements/docs.txt | 22 ++ tox.ini | 6 +- 12 files changed, 1000 insertions(+), 5 deletions(-) create mode 100644 docs/images/graph_schema.svg create mode 100644 docs/images/graph_simple.svg create mode 100644 docs/source/about-dag.md create mode 100644 docs/source/about-model-blocks.md create mode 100644 docs/source/about-operators.md create mode 100644 docs/source/about-schema.md create mode 100644 docs/source/technical-concepts.md create mode 100644 requirements/docs.txt diff --git a/docs/images/graph_schema.svg b/docs/images/graph_schema.svg new file mode 100644 index 000000000..c42c5b78a --- /dev/null +++ b/docs/images/graph_schema.svg @@ -0,0 +1,410 @@ + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Graph + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/images/graph_simple.svg b/docs/images/graph_simple.svg new file mode 100644 index 000000000..464975f7e --- /dev/null +++ b/docs/images/graph_simple.svg @@ -0,0 +1,369 @@ + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Graph + + diff --git a/docs/requirements-doc.txt b/docs/requirements-doc.txt index e5229f453..bf03f3dcb 100644 --- a/docs/requirements-doc.txt +++ b/docs/requirements-doc.txt @@ -18,5 +18,4 @@ mergedeep<1.4 docker<5.1 PyGithub<1.56 semver>=2,<3 -pytest<7.3 -coverage<6.6 + diff --git a/docs/source/about-dag.md b/docs/source/about-dag.md new file mode 100644 index 000000000..5778ed4a9 --- /dev/null +++ b/docs/source/about-dag.md @@ -0,0 +1,84 @@ +# About the Merlin Directed Acyclic Graph + +```{contents} +--- +depth: 2 +local: true +backlinks: none +--- +``` + +Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference. + +Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin. + +## Graph Terminology + +node +: A node in the DAG is a group of columns and at least one _operator_. + The columns are specified with a _column selector_. + A node has an _input schema_ and an _output schema_. + Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset. + +column selector +: A column selector specifies the columns to select from a dataset using column names or _tags_. + +operator +: An operator performs a transformation on data and return a new _node_. + The data is identified by the _column selector_. + Some simple operators like `+` and `-` add or remove columns. + More complex operations are applied by shifting the operators onto the column selector with the `>>` notation. + +schema +: A Merlin schema is metadata that describes the columns in a dataset. + Each column has its own schema that identifies the column name and can specify _tags_ and properties. + +tag +: A Merlin tag categorizes information about a column. + Adding a tag to a column enables you to select columns for operations by tag rather than name. + + For example, you can add the `USER` and `ITEM` tags to columns. + Modeling and inference operations can use that information to act accordingly on the dataset. + +## Understanding Operators, Columns, Nodes, and Schema + +Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows. +The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side." + +You can specify an explicit list of columns to run an Operator on just the specified columns. +The following code block shows the syntax for explicit column names: + +```python +result = ["col1", "col2",] >> SomeOperator(...) +``` + +Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator: + +```python +result = AnOperator(...) >> OtherOperator(...) +``` + +Chaining Operators together builds a graph. +The following figure shows how each node in the graph has an Operator. + +![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg) + +Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator. +The following figure represents an Operator that adds `colB` to a dataset. + +![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg) + +In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph. +This is for two reasons: + +1. Merlin enables you to build graphs that process categories of columns. + The categories are specified by _tags_ instead of an explicit list of column names. + + For example, you can select the continuous columns from your dataset with code like the following example: + + ```python + [Tags.CONTINUOUS] >> Operator(...) + ``` + +1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset. + The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names. \ No newline at end of file diff --git a/docs/source/about-model-blocks.md b/docs/source/about-model-blocks.md new file mode 100644 index 000000000..f8850fce7 --- /dev/null +++ b/docs/source/about-model-blocks.md @@ -0,0 +1,3 @@ +# About Merlin Model Blocks + +FIXME \ No newline at end of file diff --git a/docs/source/about-operators.md b/docs/source/about-operators.md new file mode 100644 index 000000000..9a25e221e --- /dev/null +++ b/docs/source/about-operators.md @@ -0,0 +1,85 @@ +# About Merlin Operators + +```{contents} +--- +depth: 2 +local: true +backlinks: none +--- +``` + +## Understanding Operators + +Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on. + +An Operator implements two key methods: + +Fit +: The `fit` method performs any pre-computation steps that are required before operating on data. + + For example, the `Normalize` Operator normalizes the values of a continuous column between 0 and 1. + The `fit` method determines the minimum and maximum values. + + The method is optional. + For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping. + These Operators do not need to access the data to perform any pre-computation steps. + +Transform +: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping. + +Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object. +The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time. + +```python +# Typical signature of a fit method. +def fit( + self, + selector: ColumnSelector, + dataset: Dataset +) -> Any + +# Typical signature of a transform method. +def transform( + self, + selector: ColumnSelector, + df: DataFrame +) -> DataFrame +``` + +## Operators and Columns: Column Selector + +In most cases, you want an Operator to process a subset of the columns in your input dataset. +Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on. +Merlin uses a `ColumnSelector` class to represent the columns. + +The simplest column selector is a list of strings that specify some column names. +In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class. + +```python +result = ["col1", "col2"] >> SomeOperator(...) +``` + +Column selectors also offer a more powerful and flexible way to specify columns. +You can specify the input columns to an Operator with tags. +In the following sample code, the Operator processes all the continuous variables in a dataset. + +```python +result = [Tags.CONTINUOUS] >> SomeOperator(...) +``` + +Using tags to create a column selector offers the following advantages: + +- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables. +- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset. +- Simplifies code by avoiding lists of strings for column names. + +## How to Build an Operator + +Blah. + +## Reference Documentation + +- {py:class}`merlin.dag.BaseOperator` +- {py:class}`merlin.dag.ColumnSelector` +- {py:class}`merlin.schema.Tags` +- {py:class}`merlin.io.DataSet` \ No newline at end of file diff --git a/docs/source/about-schema.md b/docs/source/about-schema.md new file mode 100644 index 000000000..aac112ae9 --- /dev/null +++ b/docs/source/about-schema.md @@ -0,0 +1,3 @@ +# About the Merlin Schema + +FIXME \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index 779dc587d..a51bd04eb 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -118,6 +118,14 @@ autosummary_generate = True +intersphinx_mapping = { + "python": ("https://docs.python.org/3", None), + "merlin-core": ("https://nvidia-merlin.github.io/core/main", None), + "merlin-systems": ("https://nvidia-merlin.github.io/systems/main", None), + "merlin-models": ("https://nvidia-merlin.github.io/models/main", None), + "NVTabular": ("https://nvidia-merlin.github.io/NVTabular/main", None), +} + copydirs_additional_dirs = ["../../examples/", "../../README.md"] copydirs_file_rename = { diff --git a/docs/source/technical-concepts.md b/docs/source/technical-concepts.md new file mode 100644 index 000000000..290d1da7f --- /dev/null +++ b/docs/source/technical-concepts.md @@ -0,0 +1,4 @@ +# Merlin Technical Concepts + +The following pages provide a deeper technical understanding of Merlin concepts. +These concepts can help you to develop your own operator to implement a more sophisticated recommender system. \ No newline at end of file diff --git a/docs/source/toc.yaml b/docs/source/toc.yaml index aa384075a..bcc2d1949 100644 --- a/docs/source/toc.yaml +++ b/docs/source/toc.yaml @@ -46,5 +46,13 @@ subtrees: title: Deploy the HugeCTR Model with Triton - file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb title: Deploy the TensorFlow Model with Triton + - title: Merlin Technical Concepts + file: technical-concepts.md + entries: + - file: about-dag.md + title: Graph Concepts + - file: about-schema.md + - file: about-operators.md + - file: about-model-blocks.md - file: containers.rst - file: support_matrix/index.rst \ No newline at end of file diff --git a/requirements/docs.txt b/requirements/docs.txt new file mode 100644 index 000000000..e5229f453 --- /dev/null +++ b/requirements/docs.txt @@ -0,0 +1,22 @@ +# docs +ipython==8.2.0 +Sphinx==3.5.4 +jinja2<3.1 +markupsafe==2.0.1 +natsort==8.1.0 +sphinx_rtd_theme +sphinx_markdown_tables +sphinx-multiversion@git+https://github.com/mikemckiernan/sphinx-multiversion.git@v0.3.0 +sphinxcontrib-copydirs@git+https://github.com/mikemckiernan/sphinxcontrib-copydirs.git@v0.3.3 +sphinx-external-toc<0.4 +myst-nb +linkify-it-py +Markdown==3.3.7 + +# smx +mergedeep<1.4 +docker<5.1 +PyGithub<1.56 +semver>=2,<3 +pytest<7.3 +coverage<6.6 diff --git a/tox.ini b/tox.ini index 80406dc4f..5a76a130c 100644 --- a/tox.ini +++ b/tox.ini @@ -36,14 +36,14 @@ commands = ; Generates documentation with sphinx. There are other steps in the Github Actions workflow ; to publish the documentation on release. changedir = {toxinidir} -deps = -rrequirements/docs.txt +deps = -r requirements/docs.txt commands = - python -m sphinx.cmd.build -P -b html docs/source docs/build/html + python -m sphinx.cmd.build -P -b {posargs:html} docs/source docs/build/{posargs:html} [testenv:docs-multi] ; Run the multi-version build that is shown on GitHub Pages. changedir = {toxinidir} -deps = -rrequirements/docs.txt +deps = -r requirements/docs.txt commands = sphinx-multiversion --dump-metadata docs/source docs/build/html | jq "keys" sphinx-multiversion docs/source docs/build/html