Skip to content

Commit

Permalink
Add tutorial for Pygrain dataset debugging tool
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 726581897
  • Loading branch information
Grain Team authored and copybara-github committed Feb 21, 2025
1 parent 42039a1 commit 078b265
Show file tree
Hide file tree
Showing 4 changed files with 350 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,5 @@
'tutorials/dataset_advanced_tutorial.ipynb',
'tutorials/dataset_basic_tutorial.ipynb',
'tutorials/data_loader_tutorial.ipynb',
'tutorials/dataset_debugging_tutorial.ipynb',
]
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ data_loader/transformations
tutorials/data_loader_tutorial
tutorials/dataset_basic_tutorial
tutorials/dataset_advanced_tutorial
tutorials/dataset_debugging_tutorial
```

<!-- Automatically generated documentation from docstrings -->
Expand Down
237 changes: 237 additions & 0 deletions docs/tutorials/dataset_debugging_tutorial.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "OHoxgqr6sRKE"
},
"source": [
"# Performance & Debugging tool\n",
"Grain offers two configurable modes that can be set to gain deeper insights into pipeline execution and identify potential issues.\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/grain/blob/main/docs/tutorials/dataset_debugging_tutorial.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xw_-jT1r6zNM"
},
"outputs": [],
"source": [
"!pip install grain"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YLaRRlCPsRKE"
},
"source": [
"## Visualization mode\n",
"To get an overview of your dataset pipeline structure and clear understanding of how the data flows, enable visualization mode. This will log a visual representation of your pipeline, allowing you to easily identify different transformation stages and their relationships. To enable visualization mode, set the `_DATASET_VISUALIZATION_OUTPUT_DIR` config option using flag `--grain_py_dataset_visualization_output_dir=\"\"` or by calling `grain.config.update(\"py_dataset_visualization_output_dir\", \"\")`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4y89Wx7PsRKE"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"scrollable": true,
"text": [
"Grain Dataset graph:\n",
"\n",
"RangeMapDataset(start=0, stop=20, step=1)\n",
" ││\n",
" ││ \n",
" ││\n",
" ╲╱\n",
"\"<class 'int'>[]\"\n",
"\n",
" ││\n",
" ││ WithOptionsMapDataset\n",
" ││\n",
" ╲╱\n",
"\"<class 'int'>[]\"\n",
"\n",
" ││\n",
" ││ ShuffleMapDataset\n",
" ││\n",
" ╲╱\n",
"\"<class 'int'>[]\"\n",
"\n",
" ││\n",
" ││ BatchMapDataset(batch_size=2, drop_remainder=False)\n",
" ││\n",
" ╲╱\n",
"'int64[2]'\n",
"\n",
" ││\n",
" ││ MapMapDataset(transform=<lambda> @ <ipython-input-1-930f8fd1bf7d>:9)\n",
" ││\n",
" ╲╱\n",
"'int64[2]'\n",
"\n",
" ││\n",
" ││ PrefetchDatasetIterator(read_options=ReadOptions(num_threads=16, prefetch_buffer_size=500), allow_nones=False)\n",
" ││\n",
" ╲╱\n",
"'int64[2]'\n",
"\n"
]
}
],
"source": [
"import grain.python as grain\n",
"\n",
"grain.config.update(\"py_dataset_visualization_output_dir\", \"\")\n",
"ds = (\n",
" grain.MapDataset.range(20)\n",
" .seed(seed=42)\n",
" .shuffle()\n",
" .batch(batch_size=2)\n",
" .map(lambda x: x)\n",
" .to_iter_dataset()\n",
")\n",
"it = iter(ds)\n",
"\n",
"# Visualization graph is constructed while iterating through the pipeline\n",
"for _ in range(10):\n",
" next(it)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_3h-u2I1i7wv"
},
"source": [
"## Debug mode\n",
"To troubleshoot performance issues in your dataset pipeline, enable debug mode. This will log a real-time execution summary of the pipeline at one-minute intervals. This execution summary provides a detailed information on each transformation stage such as processing time, number of elements processed and other details that helps in identifying the slower stages in the pipeline. To enable debug mode, set the `_DEBUG_MODE` config option using flag `--grain_py_debug_mode=true` or by calling `grain.config.update(\"py_debug_mode\",True)`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bN45Z58E3jGS"
},
"outputs": [],
"source": [
"import time\n",
"\n",
"# Define a dummy slow preprocessing function\n",
"def _dummy_slow_fn(x):\n",
" time.sleep(10)\n",
" return x"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"height": 897
},
"id": "bN45Z58E3jGS",
"outputId": "f3d640a8-1eae-414f-e6eb-e7c02c9a91df"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"scrollable": true,
"text": [
"Grain Dataset Execution Summary:\n",
"\n",
"NOTE: Before analyzing the `MapDataset` nodes, ensure that the `total_processing_time` of the `PrefetchDatasetIterator` node indicates it is a bottleneck. The `MapDataset` nodes are executed in multiple threads and thus, should not be compared to the `total_processing_time` of `DatasetIterator` nodes.\n",
"\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| id | name | inputs | percent wait time | total processing time | min processing time | max processing time | avg processing time | num produced elements |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 6 | RangeMapDataset(start=0, stop= | [] | 0.00% | 86.92us | 1.00us | 53.91us | 4.35us | 20 |\n",
"| | 20, step=1) | | | | | | | |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 5 | WithOptionsMapDataset | [6] | 0.00% | N/A | N/A | N/A | N/A | N/A |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 4 | ShuffleMapDataset | [5] | 0.00% | 15.95ms | 42.40us | 2.28ms | 797.35us | 20 |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 3 | BatchMapDataset(batch_size=2, | [4] | 0.00% | 803.14us | 47.04us | 290.24us | 80.31us | 10 |\n",
"| | drop_remainder=False) | | | | | | | |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 2 | MapMapDataset(transform=_dummy | [3] | 16.68% | 100.08s | 10.00s | 10.01s | 10.01s | 10 |\n",
"| | _slow_fn @ <ipython-input-2-23 | | | | | | | |\n",
"| | 02a47a813f>:4) | | | | | | | |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 1 | PrefetchDatasetIterator(read_o | [2] | N/A | 10.02s | 12.40us | 10.02s | 1.67s | 6 |\n",
"| | ptions=ReadOptions(num_threads | | | | | | | |\n",
"| | =16, prefetch_buffer_size=500) | | | | | | | |\n",
"| | , allow_nones=False) | | | | | | | |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"| 0 | MapDatasetIterator(transform=_ | [1] | 83.32% | 50.05s | 10.01s | 10.01s | 10.01s | 5 |\n",
"| | dummy_slow_fn @ <ipython-input | | | | | | | |\n",
"| | -2-2302a47a813f>:4) | | | | | | | |\n",
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
"\n"
]
}
],
"source": [
"import time\n",
"grain.config.update(\"py_debug_mode\", True)\n",
"\n",
"ds = (\n",
" grain.MapDataset.range(20)\n",
" .seed(seed=42)\n",
" .shuffle()\n",
" .batch(batch_size=2)\n",
" .map(_dummy_slow_fn)\n",
" .to_iter_dataset()\n",
" .map(_dummy_slow_fn)\n",
")\n",
"it = iter(ds)\n",
"\n",
"for _ in range(10):\n",
" next(it)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eSu9SOP8_x6A"
},
"source": [
"In the above execution summary, 86% of the time is spent in the `MapDatasetIterator` node and is the slowest stage of the pipeline.\n",
"\n",
"Note that although from the `total_processing_time`, it might appear that `MapMapDataset`(id:2) is the slowest stage, nodes from the id 2 to 6 are executed in multiple threads and hence, the `total_processing_time` of these nodes should be compared to the `total_processing_time` of iterator nodes(id:0)"
]
}
],
"metadata": {
"colab": {
"last_runtime": {
"build_target": "",
"kind": "local"
},
"provenance": []
},
"jupytext": {
"formats": "ipynb,md:myst",
"main_language": "python"
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
111 changes: 111 additions & 0 deletions docs/tutorials/dataset_debugging_tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
--------------------------------------------------------------------------------

jupytext: formats: ipynb,md:myst main_language: python text_representation:
extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.1
kernelspec: display_name: Python 3

## name: python3

+++ {"id": "OHoxgqr6sRKE"}

# Performance & Debugging tool

Grain offers two configurable modes that can be set to gain deeper insights into
pipeline execution and identify potential issues.
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/grain/blob/main/docs/tutorials/dataset_debugging_tutorial.ipynb)

``` {code-cell}
:id: xw_-jT1r6zNM
!pip install grain
```

+++ {"id": "YLaRRlCPsRKE"}

## Visualization mode

To get an overview of your dataset pipeline structure and clear understanding of
how the data flows, enable visualization mode. This will log a visual
representation of your pipeline, allowing you to easily identify different
transformation stages and their relationships. To enable visualization mode, set
the flag `--grain_py_dataset_visualization_output_dir=""` or call
`grain.config.update("py_dataset_visualization_output_dir", "")`

``` {code-cell}
:id: 4y89Wx7PsRKE
import grain.python as grain
grain.config.update("py_dataset_visualization_output_dir", "")
ds = (
grain.MapDataset.range(20)
.seed(seed=42)
.shuffle()
.batch(batch_size=2)
.map(lambda x: x)
.to_iter_dataset()
)
it = iter(ds)
# Visualization graph is constructed on the first iteration over the dataset
for _ in range(10):
next(it)
```

+++ {"id": "_3h-u2I1i7wv"}

## Debug mode

To troubleshoot performance issues in your dataset pipeline, enable debug mode.
This will log a real-time execution summary of the pipeline at one-minute
intervals. This execution summary provides a detailed information on each
transformation stage such as processing time, number of elements processed and
other details that helps in identifying the slower stages in the pipeline. To
enable debug mode, set the flag `--grain_py_debug_mode=true` or call
`grain.config.update("py_debug_mode",True)`

``` {code-cell}
:id: bN45Z58E3jGS
import time
# Define a dummy slow preprocessing function
def _dummy_slow_fn(x):
time.sleep(10)
return x
```

``` {code-cell}
---
colab:
height: 897
id: bN45Z58E3jGS
outputId: f3d640a8-1eae-414f-e6eb-e7c02c9a91df
---
import time
grain.config.update("py_debug_mode", True)
ds = (
grain.MapDataset.range(20)
.seed(seed=42)
.shuffle()
.batch(batch_size=2)
.map(_dummy_slow_fn)
.to_iter_dataset()
.map(_dummy_slow_fn)
)
it = iter(ds)
for _ in range(10):
next(it)
```

+++ {"id": "eSu9SOP8_x6A"}

In the above execution summary, 86% of the time is spent in the
`MapDatasetIterator` node and is the slowest stage of the pipeline.

Note that although from the `total_processing_time`, it might appear that
`MapMapDataset`(id:2) is the slowest stage, nodes from the id 2 to 6 are
executed in multiple threads and hence, the `total_processing_time` of these
nodes should be compared to the `total_processing_time` of iterator nodes(id:0)

0 comments on commit 078b265

Please sign in to comment.