-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add information about the Merlin DAG
- Define the important terms of the DAG. - Incorporate Karl's information. - Karl's info about Operators and ColumnSelectors. - Karl's info about Dataset.
- Loading branch information
1 parent
5388a1d
commit d9a832e
Showing
15 changed files
with
2,837 additions
and
6 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,5 +18,4 @@ mergedeep<1.4 | |
docker<5.1 | ||
PyGithub<1.56 | ||
semver>=2,<3 | ||
pytest<7.3 | ||
coverage<6.6 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# About the Merlin Graph | ||
|
||
```{contents} | ||
--- | ||
depth: 2 | ||
local: true | ||
backlinks: none | ||
--- | ||
``` | ||
|
||
## Purpose of the Merlin Graph | ||
|
||
Merlin uses a directed acyclic graph (DAG) to represent operations on data such as normalizing or clipping values and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference. | ||
|
||
Understanding the Merlin DAG is helpful if you want to develop your own Operator or building a recommender system with Merlin. | ||
|
||
## Graph Terminology | ||
|
||
node | ||
: A node in the DAG is a group of columns and at least one _operator_. | ||
The columns are specified with a _column selector_. | ||
A node has an _input schema_ and an _output schema_. | ||
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset. | ||
|
||
column selector | ||
: A column selector specifies the columns to select from a dataset using column names or _tags_. | ||
|
||
operator | ||
: An operator performs a transformation on data and return a new _node_. | ||
The data is identified by the _column selector_. | ||
Some simple operators like `+` and `-` add or remove columns. | ||
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation. | ||
|
||
schema | ||
: A Merlin schema is metadata that describes the columns in a dataset. | ||
Each column has its own schema that identifies the column name and can specify _tags_ and properties. | ||
|
||
tag | ||
: A Merlin tag categorizes information about a column. | ||
Adding a tag to a column enables you to select columns for operations by tag rather than name. | ||
|
||
For example, you can add the `CONTINUOUS` or `CATEGORICAL` tags to columns. | ||
Feature engineering Operators, modeling, and inference operations can use that information to operate accordingly on the dataset. | ||
|
||
## Introduction to Operators, Columns, Nodes, and Schema | ||
|
||
The NVTabular library uses Operators for feature engineering. | ||
One example of an NVTabular Operator is `Normalize`. | ||
The Operator normalizes continuous variables between `0` and `1`. | ||
|
||
The Merlin Systems library uses Operators for building ensembles and performing inference. | ||
The library includes Operators such as `FilterCandidates` and `PredictTensorflow`. | ||
You use these Operators to put your models into production and serve recommendations. | ||
|
||
Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows. | ||
The `>>` syntax means "take the output columns from the left-hand side and feed them as the input columns to the right-hand side." | ||
|
||
You can specify an explicit list of columns names for an Operator. | ||
The following code block shows the syntax for explicit column names: | ||
|
||
```python | ||
result = ["col1", "col2",] >> SomeOperator(...) | ||
``` | ||
|
||
Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator: | ||
|
||
```python | ||
result = AnOperator(...) >> OtherOperator(...) | ||
``` | ||
|
||
Chaining Operators together builds a graph. | ||
The following figure shows how each node in the graph has an Operator. | ||
|
||
![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg) | ||
|
||
```{tip} | ||
After you build an NVTabular workflow or Merlin Systems transform workflow, you can visualize the graph and create an image like the preceding example by running the `graph` method. | ||
``` | ||
|
||
Each node in a graph has an input schema and an output schema that describe the input columns to the Operator and the output columns produced by the Operator. | ||
The following figure represents an Operator, `SomeOperator`, that adds `colB` to a dataset. | ||
|
||
![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg) | ||
|
||
In practice, when Merlin first builds the graph, the workflow does not initially know which columns are input or output. | ||
This is for two reasons: | ||
|
||
1. Merlin enables you to build graphs that process categories of columns. | ||
The categories are specified by _tags_ instead of an explicit list of column names. | ||
|
||
For example, you can select the continuous columns from your dataset with code like the following example: | ||
|
||
```python | ||
[Tags.CONTINUOUS] >> Operator(...) | ||
``` | ||
|
||
1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset. | ||
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names. | ||
|
||
## Reference Documentation | ||
|
||
- {py:class}`nvtabular.ops.Normalize` | ||
- {py:class}`nvtabular.workflow.workflow.Workflow` | ||
- {py:class}`merlin.systems.dag.ops.workflow.TransformWorkflow` | ||
- {py:class}`merlin.systems.dag.Ensemble` | ||
- {py:class}`merlin.systems.dag.ops.session_filter.FilterCandidates` | ||
- {py:class}`merlin.systems.dag.tensorflow.PredictTensorFlow` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# About the Merlin Dataset | ||
|
||
```{contents} | ||
--- | ||
depth: 2 | ||
local: true | ||
backlinks: none | ||
--- | ||
``` | ||
|
||
## On-disk Representation | ||
|
||
The Apache Parquet file format is the most-frequently used file format for Merlin datasets. | ||
|
||
Parquet is a columnar storage format. | ||
The format arranges the values for each column in a long list. | ||
This format is in contrast with a row-oriented format---such as a comma-separated values format---that arranges all the data for one row together. | ||
|
||
As an analogy, columnar storage is like a dictionary of columns instead of row-oriented storage that is like a list of rows. | ||
|
||
In most cases, a Parquet dataset includes multiple files in one or more directories. | ||
|
||
![The Merlin dataset class can read a directory of Parquet files for data access.](../images/parquet_and_dataset.svg) | ||
|
||
The Merlin dataset class, `merlin.io.Dataset`, treats a collection of many Parquet files as a single dataset. | ||
By treating the collection as a single dataset, Merlin simplifies distributing computation over multiple GPUs or multiple machines. | ||
|
||
The dataset class is not a copy of the data or a modification of the Parquet files. | ||
An instance of the class is similar to a collection of pointers to the Parquet files. | ||
|
||
When you create an instance of the dataset class, Merlin attempts to infer a schema by reading one record of the data. | ||
Merlin attempts to determine the column names and data types. | ||
|
||
## Processing Data: Dataset and DataFrame | ||
|
||
When you perform a computation on a Merlin dataset, the dataset reads from the files on disk and converts them into a set of DataFrames. | ||
The DataFrames, like Parquet files, use a columnar storage format. | ||
The API for a DataFrame is similar to a Python dictionary---you can reference a column with syntax like `dataframe['col1']`. | ||
|
||
![A Merlin dataset reads data from disk and becomes several DataFrames.](../images/dataset_and_dataframe.svg) | ||
|
||
Merlin processes each DataFrame individually and aggregates the results across the DataFrames as needed. | ||
There are two kinds of computations that you can perform on a dataset: `fit` and `transform`. | ||
|
||
The `fit` computations perform a full pass over the dataset to compute statistics, find unique values, perform grouping, or another operation that requires information from multiple DataFrames. | ||
|
||
The `transform` computations process each DataFrame individually. | ||
These computations use the information gathered from `fit` to alter the DataFrame. | ||
For example the `Normalize` and `Clip` Operators compute new values for columns and the `Rename` Operator adds and removes columns. | ||
|
||
More information about the `fit` and `transform` methods is provided in [](./about-operators.md). | ||
|
||
## Reference Documentation | ||
|
||
- {py:class}`merlin.io.Dataset` | ||
- {py:class}`nvtabular.ops.Normalize` | ||
- {py:class}`nvtabular.ops.Clip` | ||
- {py:class}`nvtabular.ops.Rename` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# About Merlin Model Blocks | ||
|
||
FIXME |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# About Merlin Operators | ||
|
||
```{contents} | ||
--- | ||
depth: 2 | ||
local: true | ||
backlinks: none | ||
--- | ||
``` | ||
|
||
## Understanding Operators | ||
|
||
Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on. | ||
|
||
An Operator implements two key methods: | ||
|
||
Fit | ||
: The `fit` method performs any pre-computation steps that are required before modifying the data. | ||
|
||
For example, the `Normalize` Operator normalizes the values of a continuous variable between `0` and `1`. | ||
The `fit` method determines the minimum and maximum values. | ||
|
||
The method is optional. | ||
For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping. | ||
These Operators do not need to access the data to perform any pre-computation steps. | ||
|
||
Transform | ||
: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping. | ||
This method modifies the data. | ||
|
||
Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object. | ||
The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time. | ||
|
||
```{code-block} python | ||
--- | ||
emphasize-lines: 5, 12 | ||
--- | ||
# Typical signature of a fit method. | ||
def fit( | ||
self, | ||
selector: ColumnSelector, | ||
dataset: Dataset | ||
) -> Any | ||
# Typical signature of a transform method. | ||
def transform( | ||
self, | ||
selector: ColumnSelector, | ||
df: DataFrame | ||
) -> DataFrame | ||
``` | ||
|
||
## Operators and Columns: Column Selector | ||
|
||
In most cases, you want an Operator to process a subset of the columns in your input dataset. | ||
Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on. | ||
Merlin uses a `ColumnSelector` class to represent the columns. | ||
|
||
The simplest column selector is a list of strings that specify some column names. | ||
In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class. | ||
|
||
```python | ||
result = ["col1", "col2"] >> SomeOperator(...) | ||
``` | ||
|
||
Column selectors also offer a more powerful and flexible way to specify columns. | ||
You can specify the input columns to an Operator with tags. | ||
In the following sample code, the Operator processes all the continuous variables in a dataset. | ||
|
||
```python | ||
result = [Tags.CONTINUOUS] >> SomeOperator(...) | ||
``` | ||
|
||
Using tags to create a column selector offers the following advantages: | ||
|
||
- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables. | ||
- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset. | ||
- Simplifies code by avoiding lists of strings for column names. | ||
|
||
## How to Build an Operator | ||
|
||
Blah. | ||
|
||
## Reference Documentation | ||
|
||
- {py:class}`merlin.dag.BaseOperator` | ||
- {py:class}`merlin.dag.ColumnSelector` | ||
- {py:class}`merlin.schema.Tags` | ||
- {py:class}`merlin.io.DataSet` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# About the Merlin Schema | ||
|
||
FIXME |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Merlin Technical Concepts | ||
|
||
The following pages provide a deeper technical understanding of Merlin concepts. | ||
These concepts can help you to develop your own operator to implement a more sophisticated recommender system. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# docs | ||
ipython==8.2.0 | ||
Sphinx==3.5.4 | ||
jinja2<3.1 | ||
markupsafe==2.0.1 | ||
natsort==8.1.0 | ||
sphinx_rtd_theme | ||
sphinx_markdown_tables | ||
sphinx-multiversion@git+https://github.com/mikemckiernan/[email protected] | ||
sphinxcontrib-copydirs@git+https://github.com/mikemckiernan/[email protected] | ||
sphinx-external-toc<0.4 | ||
myst-nb | ||
linkify-it-py | ||
Markdown==3.3.7 | ||
|
||
# smx | ||
mergedeep<1.4 | ||
docker<5.1 | ||
PyGithub<1.56 | ||
semver>=2,<3 | ||
pytest<7.3 | ||
coverage<6.6 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters