Skip to content

Commit

Permalink
Updating Docs (#46)
Browse files Browse the repository at this point in the history
  • Loading branch information
rsamf authored Jul 24, 2024
1 parent 244091e commit 173f9b3
Show file tree
Hide file tree
Showing 16 changed files with 338 additions and 53 deletions.
53 changes: 42 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,62 @@
# Graphbook
Graphbook is an extensible and interactive ML workflow editor that allows you to build monitorable data processing pipelines powered with ML. It can be used to prepare training data for custom ML models, experiment with custom trained or off-the-shelf models, and to build ML-based ETL applications. Custom nodes can be built in Python, and Graphbook will behave like a framework and call lifecycle methods on those nodes.
<p align="center">
<a href="https://graphbook.ai">
<img src="docs/_static/graphbook.png" alt="Logo" width=256>
</a>

<h1 align="center">Graphbook</h1>

<p align="center">
The ML workflow framework
<br>
<a href="https://github.com/graphbookai/graphbook/issues/new?template=bug.md">Report bug</a>
·
<a href="https://github.com/graphbookai/graphbook/issues/new?template=feature.md&labels=feature">Request feature</a>
</p>

<p align="center">
<a href="#overview">Overview</a> •
<a href="#current-features">Current Features</a> •
<a href="#getting-started">Getting Started</a> •
<a href="#collaboration">Collaboration</a>
</p>
</p>

## Overview
Graphbook is a framework for building efficient, visual DAG-structured ML workflows composed of nodes written in Python. Graphbook provides common ML processing features such as multiprocessing IO and automatic batching, and it features a web-based UI to assemble, monitor, and execute data processing workflows. It can be used to prepare training data for custom ML models, experiment with custom trained or off-the-shelf models, and to build ML-based ETL applications. Custom nodes can be built in Python, and Graphbook will behave like a framework and call lifecycle methods on those nodes.

## Current Features
- Graph-based visual editor to experiment and create complex ML workflows
- Internal Python code editor
- Executable graph, subgraphs, and individual nodes
- Only re-executes parts of the workflow that changes between executions
- Node logging and output views for monitoring in the UI
- ​​Graph-based visual editor to experiment and create complex ML workflows
- Caches outputs and only re-executes parts of the workflow that changes between executions
- UI monitoring components for logs and outputs per node
- Custom buildable nodes with Python
- Automatic batching for Pytorch tensors
- Multiprocessing I/O to and from disk and network
- Customizeable multiprocessing functions
- Customizable multiprocessing functions
- Ability to execute entire graphs, or individual subgraphs/nodes
- Ability to execute singular batches of data
- Ability to pause graph execution
- Basic nodes for filtering, loading, and saving outputs
- Node grouping and subflows
- Shareable serialized workflow files
- Autosaving and shareable serialized workflow files
- Registers node code changes without needing a restart
- Monitorable CPU and GPU resource usage

## Getting Started
### Install from PyPI
1. `pip install graphbook`
1. `graphbook`
1. Visit http://localhost:8007

### Install with Docker
1. `docker run --rm -p 8005:8005 -p 8006:8006 -p 8007:8007 -v $PWD/workflows:/app/workflows rsamf/graphbook:latest`
1. Pull and run the downloaded image
```bash
docker run --rm -p 8005:8005 -p 8006:8006 -p 8007:8007 -v $PWD/workflows:/app/workflows rsamf/graphbook:latest
```
1. Visit http://localhost:8007

Visit the [docs](https://docs.graphbook.ai) to learn more on how to create custom nodes and workflows with Graphbook.

## Collaboration Guide
## Collaboration
This is a guide on how to get started developing Graphbook. If you are simply using Graphbook, view the [Getting Started](#getting-started) section.

### Run Graphbook in Development Mode
Expand Down
Binary file added docs/_static/1_first_step.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/2_first_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/graphbook.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 32 additions & 0 deletions docs/concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Concepts
########

Note
*****

Note is the formal name given to the unit of data that flows through a Graphbook workflow. A Note simply holds a dictionary which encapsulates certain information about the thing that is being processed. For example, we can have Notes for all of the world's cars where each Note stores a car’s model, manufacturer, price, and an array of images of the car.

Step
*****

Step is the one of the two fundamental node types in Graphbook and is a class meant to be extended in Python. Steps are the functional nodes that define the logic of our data processing pipeline. They are fed Notes as input and respond with 0 or more Notes (or an array of notes) at each of its output slots.

Resource
********

Resource is the second fundamental node type in Graphbook and is also an extendible class. It simply holds static information, or is a Python object, that is fed as a parameter to another Resource or Step. A prime example of a Resource is a model. Tip: If a larger object such as a model is being used in multiple Steps in your workflow, it is best to reuse them by putting them in Resources and use the model as a parameter.

Step Lifecycle
**************

A Step goes through lifecycle methods upon processing a Note. The below methods are open for extension and are subject to change in future versions of Graphbook.

#. ``__init__``: The constructor of the Step. This is where you can set up the Step and its parameters. This will not be re-called if a node's code or its parameter values have not changed.
#. ``on_start``: This is called at the start of each graph execution.
#. ``on_before_items``: This is called before the Step processes each Note.
#. ``on_item``: This is called for each item.
#. ``on_item_batch``: This is called for each batch of items. *This only gets called if the Step is a BatchStep.*
#. ``on_after_items``: This is called after the Step processes each Note.
#. ``forward_note``: This is called after the Step processes each Note and is used to route the Note to a certain output slot.
#. ``on_end``: This is called at the end of each graph execution.

9 changes: 8 additions & 1 deletion docs/contributing.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
.. _repository: https://github.com/graphbookai/graphbook

.. _contributing:

Contributing
*****************
############

To learn how to run Graphbook in development mode, visit our Github repository_.

If you would like to contribute to the development of Graphbook, please fork the repo and submit your PR.
3 changes: 2 additions & 1 deletion docs/examples.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
Examples
*****************
########

Coming soon.
18 changes: 0 additions & 18 deletions docs/getting_started.rst

This file was deleted.

105 changes: 105 additions & 0 deletions docs/guides.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
Guides
###########

Naturally, the development lifecycle of a Graphbook workflow can be illustrated in a few simple steps:

#. **Build in Python**

Write processing nodes using Python in your favorite code editor

#. **Assemble in Graphbook**

Assemble an ML workflow in our graph-based editor with your own processing nodes

#. **Execute**

Run, monitor, and adjust parameters in your workflow

Build Your First ML Workflow
=============================
All of your custom nodes should be located inside a directory that was automatically created for you upon running ``graphbook``.
Graphbook is tracking that directory for any files ending with **.py** and will automatically load classes that inherit from `Step` or `Resource` as custom nodes.
Inside this directory, create a new Python file called `my_first_nodes.py`, and create the below step inside of it:

.. code-block:: python
from graphbook.steps import Step
from graphbook import Note
import random
class MyFirstStep(Step):
RequiresInput = True
Parameters = {
"prob": {
"type": "resource"
}
}
Outputs = ["A", "B"]
Category = "Custom"
def __init__(self, id, logger, prob):
super().__init__(id, logger)
self.prob = prob
def on_after_items(self, note: Note):
self.logger.log(note['message'])
def forward_note(self, note: Note) -> str:
if random.random() < self.prob:
return "A"
return "B"
Go into the Graphbook UI, create a new workflow by adding a new .json file.
Then, right click the pane, and add a new Step node and select `MyFirstStep` from the dropdown (Add Step > Custom > MyFirstStep).
See how your inputs and outputs are automatically populated.

.. image:: _static/1_first_step.png
:alt: First Step
:align: center

Try running the graph.
Notice how you get an error.
That's because you have no inputs and you haven't configured `prob`.
Let's create a Source Step that generates fake data.

.. code-block:: python
from graphbook.steps import SourceStep
class MyFirstSource(SourceStep):
RequiresInput = False
Parameters = {
"message": {
"type": "string",
"default": "Hello, World!"
}
}
Outputs = ["message"]
Category = "Custom"
def __init__(self, id, logger, message):
super().__init__(id, logger)
self.message = message
def load(self):
return {
"message": [Note({"message": self.message}) for _ in range(10)]
}
def forward_note(self, note: Note) -> str:
return "message"
Add the new node to your workflow.
Connect the output slot "message" from `MyFirstSource` to the input slot "in" on `MyFirstStep`.
Then, add a NumberResource to your workflow (Add Resource > Util > Number).
Inside of the value widget enter a number between 0 and 1 (e.g. 0.5).
Now run it again, observe the logs, and observe the outputs.

.. image:: _static/2_first_workflow.png
:alt: First Workflow
:align: center


Voila! You have successfully created your first workflow.

.. note::

More guides are coming soon!
97 changes: 93 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,105 @@
Graphbook
=====================================
Graphbook is the open-source, interactive, and extensible editor for ML workflows. It is designed to help you build, visualize, and monitor your ML workflows with ease.
.. image:: _static/graphbook.png
:target: Graphbook_
:alt: Homepage
:width: 150px
:align: center

.. _Graphbook: https://www.graphbook.ai

Graphbook
##########

Graphbook is a framework for building efficient, visual DAG-structured ML workflows composed of nodes written in Python. Graphbook provides common ML processing features such as multiprocessing IO and automatic batching, and it features a web-based UI to assemble, monitor, and execute data processing workflows.

Features
*********

* ​​Graph-based visual editor to experiment and create complex ML workflows
* Caches outputs and only re-executes parts of the workflow that changes between executions
* UI monitoring components for logs and outputs per node
* Custom buildable nodes with Python
* Automatic batching for Pytorch tensors
* Multiprocessing I/O to and from disk and network
* Customizable multiprocessing functions
* Ability to execute entire graphs, or individual subgraphs/nodes
* Ability to execute singular batches of data
* Ability to pause graph execution
* Basic nodes for filtering, loading, and saving outputs
* Node grouping and subflows
* Autosaving and shareable serialized workflow files
* Registers node code changes without needing a restart
* Monitorable CPU and GPU resource usage

Applications
*************

Graphbook can be used for a variety of applications, including the development of:

* ML-powered data processing or ETL pipeline
* Custom dataset curation
* Monitoring solution for AI/ML workflows
* A no-code ML development platform
* Workflows for training/inference of custom ML models


Why
****

In developing ML workflows for AI applications, a number of problems arise.

1. Multiprocessing
====================

We want to have enough IO speed to keep our GPU from being underutilized. GPUs cost a lot more money than CPUs, so it is important to scale the number of worker processes to load and dump data from and to IO.

Graphbook takes care of this by setting up worker processes that can perform both the loading and dumping. Additionally, users can monitor the performance of workers, so that they can know when to scale up or down.

2. Batching
=================

To take full advantage of the parallel processing capabilities of our GPU, we want to batch our units of data to maximize efficiency of our workflows. This is done with Pytorch’s builtin Dataset and Dataloader class, which handles both the multiprocessing and batching of tensors, but our input data is not just tensors. Our units of data, which we call Notes (which are basically just Python dicts), can be supplemented with properties such as attributes, annotations, and database IDs. While we process, we need to load and batch what goes onto our GPU while knowing about the entirety of the Note.

Thus, we have a specific batching process that allows our nodes to load and batch along an individual property within a Note while still holding a reference to that belonging Note.

3. Fast, Easy Iteration
=========================

Our workflows need to be able to be changed quickly due to many factors, and our software engineers aren’t the only ones with the know-how on how to fix things. We need an easy and accessible approach to declare our workflows and adjust its parameters, so that we can quickly iterate while giving the opportunity to adjust the logic of our workflows to anyone.

Software engineers build modular nodes in Python by extending the behavior of Graphbook processing nodes, and then, with our web-based UI workflow editor, anyone can manage the no-code interface with the custom built nodes that were previously written in Python.

4. Evaluation
================
It is difficult to estimate the performance of models in the wild. While a model is in production, it processes unlabeled data and we are supposed to trust its autonomous outputs. Furthermore, the output of generative AI is normally subjective. There is no straightforward way to accurately measure the fidelity or textual alignment of AI generated images. Thus, we need an approach to monitor the quality of our ML models by “peeking” into their outputs with human observation.

Through its web UI, Graphbook offers textual and visual views of the outputs and logs of each individual workflow node and allows the user to scroll through them.

5. Flexible Controls
======================
While experimenting with models and iterating upon the development of our ML workflows, we want our workflow to be interactive and offer the user complete freedom on how our graphs are executed. That means we need to be able to run singular batches and observe the output without having to write custom code to do so. When something seems to be going wrong, we want to be able to pause the current execution and diagnose the issue. When we want to focus on the development or observe a particular subgraph (or singular node) amongst a much larger graph, we want to be able to execute it without running or affecting the rest of the graph.

Graphbook supports pausing a currently executing workflow, execution of a particular subgraph, and “stepping” through a node which runs a node’s parent dependencies and itself until the node produces an output. It supports all of these controls without forcing the developer to write any more code.

6. Caching
===========
If a node has executed before, we want to cache its results, so that outgoing nodes can use those previously generated results. This saves us a lot of processing time because we don’t have to re-execute parts of the graph that weren’t changed while working on other parts of the graph.

In Graphbook, each of the nodes have cached results and will only be cleared if the internal Python code from that node changes, its parameters change, or if they are manually cleared by a user. Currently, the results are cached in memory which makes it fast but very limited, so we have plans to adopt an external solution.

7. Open Source
===============
Most importantly, we need such a solution to be open source while offering a way to manage our own deployments because we are working with proprietary data. We need to stay away from commercial platforms or services that force us to trust in their business practices to keep our data private and secure.

Graphbook is completely free and open source and can be deployed anywhere. We offer the source on GitHub and maintain PyPI packages and container builds.

.. toctree::
:caption: Graphbook
:maxdepth: 2
:hidden:

installing
getting_started
concepts
guides
reference
examples
contributing
Loading

0 comments on commit 173f9b3

Please sign in to comment.