-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add information about the Merlin DAG
- Define the important terms of the DAG. - Incorporate Karl's information. - Karl's info about Operators and ColumnSelectors.
- Loading branch information
1 parent
c2f9519
commit ef04422
Showing
12 changed files
with
1,000 additions
and
5 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,5 +18,4 @@ mergedeep<1.4 | |
docker<5.1 | ||
PyGithub<1.56 | ||
semver>=2,<3 | ||
pytest<7.3 | ||
coverage<6.6 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# About the Merlin Directed Acyclic Graph | ||
|
||
```{contents} | ||
--- | ||
depth: 2 | ||
local: true | ||
backlinks: none | ||
--- | ||
``` | ||
|
||
Merlin uses a directed acyclic graph (DAG) to represent operations on data such as filtering or bucketing and to represent operations in a recommender system such as creating an ensemble or filtering candidate items during inference. | ||
|
||
Understanding the Merlin DAG is helpful if you want to develop your own operator (Op) or building a recommender system with Merlin. | ||
|
||
## Graph Terminology | ||
|
||
node | ||
: A node in the DAG is a group of columns and at least one _operator_. | ||
The columns are specified with a _column selector_. | ||
A node has an _input schema_ and an _output schema_. | ||
Resolution of the schemas is delayed until you run `fit` or `transform` on a dataset. | ||
|
||
column selector | ||
: A column selector specifies the columns to select from a dataset using column names or _tags_. | ||
|
||
operator | ||
: An operator performs a transformation on data and return a new _node_. | ||
The data is identified by the _column selector_. | ||
Some simple operators like `+` and `-` add or remove columns. | ||
More complex operations are applied by shifting the operators onto the column selector with the `>>` notation. | ||
|
||
schema | ||
: A Merlin schema is metadata that describes the columns in a dataset. | ||
Each column has its own schema that identifies the column name and can specify _tags_ and properties. | ||
|
||
tag | ||
: A Merlin tag categorizes information about a column. | ||
Adding a tag to a column enables you to select columns for operations by tag rather than name. | ||
|
||
For example, you can add the `USER` and `ITEM` tags to columns. | ||
Modeling and inference operations can use that information to act accordingly on the dataset. | ||
|
||
## Understanding Operators, Columns, Nodes, and Schema | ||
|
||
Merlin enables you to chain together Operators with the `>>` syntax to create feature-processing workflows. | ||
The `>>` syntax means "take the output of the left-hand side and feed it into the input of the right-hand side." | ||
|
||
You can specify an explicit list of columns to run an Operator on just the specified columns. | ||
The following code block shows the syntax for explicit column names: | ||
|
||
```python | ||
result = ["col1", "col2",] >> SomeOperator(...) | ||
``` | ||
|
||
Or, you can use the `>>` syntax between Operators to run one Operator on all the output columns from the preceding Operator: | ||
|
||
```python | ||
result = AnOperator(...) >> OtherOperator(...) | ||
``` | ||
|
||
Chaining Operators together builds a graph. | ||
The following figure shows how each node in the graph has an Operator. | ||
|
||
![A directed graph with two nodes. The first node is a Selection Operator and selects columns "col1" and "col2." The second node receives the two columns as its input. The second node has a fictional SomeOperator Operator.](../images/graph_simple.svg) | ||
|
||
Each node in a graph has an input schema and an output schema that describe the columns that go into an Operator and the columns that go out of an Operator. | ||
The following figure represents an Operator that adds `colB` to a dataset. | ||
|
||
![Part of a directed graph that shows the input schema to a fictional SomeOperator Operator as "colA". The fictional Operator adds adds "colB" and the result is an output schema with "colA" and "colB."](../images/graph_schema.svg) | ||
|
||
In practice, the workflow does not know which columns are processed or produced immediately when Merlin builds the graph. | ||
This is for two reasons: | ||
|
||
1. Merlin enables you to build graphs that process categories of columns. | ||
The categories are specified by _tags_ instead of an explicit list of column names. | ||
|
||
For example, you can select the continuous columns from your dataset with code like the following example: | ||
|
||
```python | ||
[Tags.CONTINUOUS] >> Operator(...) | ||
``` | ||
|
||
1. You can chain Operators together into a graph, such as an NVTabular workflow, before you specify a dataset. | ||
The graph, Operators, and schema do not know which columns will be selected by tag until the software accesses the dataset and determines the column names. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# About Merlin Model Blocks | ||
|
||
FIXME |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# About Merlin Operators | ||
|
||
```{contents} | ||
--- | ||
depth: 2 | ||
local: true | ||
backlinks: none | ||
--- | ||
``` | ||
|
||
## Understanding Operators | ||
|
||
Merlin uses Operators to perform computation on datasets such as normalizing continuous variables, bucketing continuous variables, clipping variables between minimum and maximum values, and so on. | ||
|
||
An Operator implements two key methods: | ||
|
||
Fit | ||
: The `fit` method performs any pre-computation steps that are required before operating on data. | ||
|
||
For example, the `Normalize` Operator normalizes the values of a continuous column between 0 and 1. | ||
The `fit` method determines the minimum and maximum values. | ||
|
||
The method is optional. | ||
For example, the `Bucketize` and `Clip` Operators do not implement the method because you specify the bucket boundaries or the minimum and maximum values for clipping. | ||
These Operators do not need to access the data to perform any pre-computation steps. | ||
|
||
Transform | ||
: The `transform` method operates on the dataset such as normalizing values, bucketing, or clipping. | ||
|
||
Another difference between the two methods is that the `fit` method accepts a Merlin dataset object and the `transform` method accepts a DataFrame object. | ||
The difference is an implementation detail---the `fit` method must access all the data and the `transform` method processes each part of the dataset one at a time. | ||
|
||
```python | ||
# Typical signature of a fit method. | ||
def fit( | ||
self, | ||
selector: ColumnSelector, | ||
dataset: Dataset | ||
) -> Any | ||
|
||
# Typical signature of a transform method. | ||
def transform( | ||
self, | ||
selector: ColumnSelector, | ||
df: DataFrame | ||
) -> DataFrame | ||
``` | ||
|
||
## Operators and Columns: Column Selector | ||
|
||
In most cases, you want an Operator to process a subset of the columns in your input dataset. | ||
Both the `fit` and `transform` methods have a `selector` argument that specifies the columns to operate on. | ||
Merlin uses a `ColumnSelector` class to represent the columns. | ||
|
||
The simplest column selector is a list of strings that specify some column names. | ||
In the following sample code, `["col1", "col2"]` become an instance of a `ColumnSelector` class. | ||
|
||
```python | ||
result = ["col1", "col2"] >> SomeOperator(...) | ||
``` | ||
|
||
Column selectors also offer a more powerful and flexible way to specify columns. | ||
You can specify the input columns to an Operator with tags. | ||
In the following sample code, the Operator processes all the continuous variables in a dataset. | ||
|
||
```python | ||
result = [Tags.CONTINUOUS] >> SomeOperator(...) | ||
``` | ||
|
||
Using tags to create a column selector offers the following advantages: | ||
|
||
- Enables you to apply several Operators to the same kind of columns, such as categorical or continuous variables. | ||
- Reduces code maintenance by enabling your code to automatically operate on newly added columns in a dataset. | ||
- Simplifies code by avoiding lists of strings for column names. | ||
|
||
## How to Build an Operator | ||
|
||
Blah. | ||
|
||
## Reference Documentation | ||
|
||
- {py:class}`merlin.dag.BaseOperator` | ||
- {py:class}`merlin.dag.ColumnSelector` | ||
- {py:class}`merlin.schema.Tags` | ||
- {py:class}`merlin.io.DataSet` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# About the Merlin Schema | ||
|
||
FIXME |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Merlin Technical Concepts | ||
|
||
The following pages provide a deeper technical understanding of Merlin concepts. | ||
These concepts can help you to develop your own operator to implement a more sophisticated recommender system. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# docs | ||
ipython==8.2.0 | ||
Sphinx==3.5.4 | ||
jinja2<3.1 | ||
markupsafe==2.0.1 | ||
natsort==8.1.0 | ||
sphinx_rtd_theme | ||
sphinx_markdown_tables | ||
sphinx-multiversion@git+https://github.com/mikemckiernan/[email protected] | ||
sphinxcontrib-copydirs@git+https://github.com/mikemckiernan/[email protected] | ||
sphinx-external-toc<0.4 | ||
myst-nb | ||
linkify-it-py | ||
Markdown==3.3.7 | ||
|
||
# smx | ||
mergedeep<1.4 | ||
docker<5.1 | ||
PyGithub<1.56 | ||
semver>=2,<3 | ||
pytest<7.3 | ||
coverage<6.6 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters