-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add tutorial for Pygrain dataset debugging tool
PiperOrigin-RevId: 726581897
- Loading branch information
1 parent
42039a1
commit 078b265
Showing
4 changed files
with
350 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "OHoxgqr6sRKE" | ||
}, | ||
"source": [ | ||
"# Performance & Debugging tool\n", | ||
"Grain offers two configurable modes that can be set to gain deeper insights into pipeline execution and identify potential issues.\n", | ||
"[](https://colab.research.google.com/github/google/grain/blob/main/docs/tutorials/dataset_debugging_tutorial.ipynb)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "xw_-jT1r6zNM" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install grain" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "YLaRRlCPsRKE" | ||
}, | ||
"source": [ | ||
"## Visualization mode\n", | ||
"To get an overview of your dataset pipeline structure and clear understanding of how the data flows, enable visualization mode. This will log a visual representation of your pipeline, allowing you to easily identify different transformation stages and their relationships. To enable visualization mode, set the `_DATASET_VISUALIZATION_OUTPUT_DIR` config option using flag `--grain_py_dataset_visualization_output_dir=\"\"` or by calling `grain.config.update(\"py_dataset_visualization_output_dir\", \"\")`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "4y89Wx7PsRKE" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"scrollable": true, | ||
"text": [ | ||
"Grain Dataset graph:\n", | ||
"\n", | ||
"RangeMapDataset(start=0, stop=20, step=1)\n", | ||
" ││\n", | ||
" ││ \n", | ||
" ││\n", | ||
" ╲╱\n", | ||
"\"<class 'int'>[]\"\n", | ||
"\n", | ||
" ││\n", | ||
" ││ WithOptionsMapDataset\n", | ||
" ││\n", | ||
" ╲╱\n", | ||
"\"<class 'int'>[]\"\n", | ||
"\n", | ||
" ││\n", | ||
" ││ ShuffleMapDataset\n", | ||
" ││\n", | ||
" ╲╱\n", | ||
"\"<class 'int'>[]\"\n", | ||
"\n", | ||
" ││\n", | ||
" ││ BatchMapDataset(batch_size=2, drop_remainder=False)\n", | ||
" ││\n", | ||
" ╲╱\n", | ||
"'int64[2]'\n", | ||
"\n", | ||
" ││\n", | ||
" ││ MapMapDataset(transform=<lambda> @ <ipython-input-1-930f8fd1bf7d>:9)\n", | ||
" ││\n", | ||
" ╲╱\n", | ||
"'int64[2]'\n", | ||
"\n", | ||
" ││\n", | ||
" ││ PrefetchDatasetIterator(read_options=ReadOptions(num_threads=16, prefetch_buffer_size=500), allow_nones=False)\n", | ||
" ││\n", | ||
" ╲╱\n", | ||
"'int64[2]'\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import grain.python as grain\n", | ||
"\n", | ||
"grain.config.update(\"py_dataset_visualization_output_dir\", \"\")\n", | ||
"ds = (\n", | ||
" grain.MapDataset.range(20)\n", | ||
" .seed(seed=42)\n", | ||
" .shuffle()\n", | ||
" .batch(batch_size=2)\n", | ||
" .map(lambda x: x)\n", | ||
" .to_iter_dataset()\n", | ||
")\n", | ||
"it = iter(ds)\n", | ||
"\n", | ||
"# Visualization graph is constructed while iterating through the pipeline\n", | ||
"for _ in range(10):\n", | ||
" next(it)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "_3h-u2I1i7wv" | ||
}, | ||
"source": [ | ||
"## Debug mode\n", | ||
"To troubleshoot performance issues in your dataset pipeline, enable debug mode. This will log a real-time execution summary of the pipeline at one-minute intervals. This execution summary provides a detailed information on each transformation stage such as processing time, number of elements processed and other details that helps in identifying the slower stages in the pipeline. To enable debug mode, set the `_DEBUG_MODE` config option using flag `--grain_py_debug_mode=true` or by calling `grain.config.update(\"py_debug_mode\",True)`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "bN45Z58E3jGS" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import time\n", | ||
"\n", | ||
"# Define a dummy slow preprocessing function\n", | ||
"def _dummy_slow_fn(x):\n", | ||
" time.sleep(10)\n", | ||
" return x" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"colab": { | ||
"height": 897 | ||
}, | ||
"id": "bN45Z58E3jGS", | ||
"outputId": "f3d640a8-1eae-414f-e6eb-e7c02c9a91df" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"scrollable": true, | ||
"text": [ | ||
"Grain Dataset Execution Summary:\n", | ||
"\n", | ||
"NOTE: Before analyzing the `MapDataset` nodes, ensure that the `total_processing_time` of the `PrefetchDatasetIterator` node indicates it is a bottleneck. The `MapDataset` nodes are executed in multiple threads and thus, should not be compared to the `total_processing_time` of `DatasetIterator` nodes.\n", | ||
"\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| id | name | inputs | percent wait time | total processing time | min processing time | max processing time | avg processing time | num produced elements |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 6 | RangeMapDataset(start=0, stop= | [] | 0.00% | 86.92us | 1.00us | 53.91us | 4.35us | 20 |\n", | ||
"| | 20, step=1) | | | | | | | |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 5 | WithOptionsMapDataset | [6] | 0.00% | N/A | N/A | N/A | N/A | N/A |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 4 | ShuffleMapDataset | [5] | 0.00% | 15.95ms | 42.40us | 2.28ms | 797.35us | 20 |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 3 | BatchMapDataset(batch_size=2, | [4] | 0.00% | 803.14us | 47.04us | 290.24us | 80.31us | 10 |\n", | ||
"| | drop_remainder=False) | | | | | | | |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 2 | MapMapDataset(transform=_dummy | [3] | 16.68% | 100.08s | 10.00s | 10.01s | 10.01s | 10 |\n", | ||
"| | _slow_fn @ <ipython-input-2-23 | | | | | | | |\n", | ||
"| | 02a47a813f>:4) | | | | | | | |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 1 | PrefetchDatasetIterator(read_o | [2] | N/A | 10.02s | 12.40us | 10.02s | 1.67s | 6 |\n", | ||
"| | ptions=ReadOptions(num_threads | | | | | | | |\n", | ||
"| | =16, prefetch_buffer_size=500) | | | | | | | |\n", | ||
"| | , allow_nones=False) | | | | | | | |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"| 0 | MapDatasetIterator(transform=_ | [1] | 83.32% | 50.05s | 10.01s | 10.01s | 10.01s | 5 |\n", | ||
"| | dummy_slow_fn @ <ipython-input | | | | | | | |\n", | ||
"| | -2-2302a47a813f>:4) | | | | | | | |\n", | ||
"|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import time\n", | ||
"grain.config.update(\"py_debug_mode\", True)\n", | ||
"\n", | ||
"ds = (\n", | ||
" grain.MapDataset.range(20)\n", | ||
" .seed(seed=42)\n", | ||
" .shuffle()\n", | ||
" .batch(batch_size=2)\n", | ||
" .map(_dummy_slow_fn)\n", | ||
" .to_iter_dataset()\n", | ||
" .map(_dummy_slow_fn)\n", | ||
")\n", | ||
"it = iter(ds)\n", | ||
"\n", | ||
"for _ in range(10):\n", | ||
" next(it)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "eSu9SOP8_x6A" | ||
}, | ||
"source": [ | ||
"In the above execution summary, 86% of the time is spent in the `MapDatasetIterator` node and is the slowest stage of the pipeline.\n", | ||
"\n", | ||
"Note that although from the `total_processing_time`, it might appear that `MapMapDataset`(id:2) is the slowest stage, nodes from the id 2 to 6 are executed in multiple threads and hence, the `total_processing_time` of these nodes should be compared to the `total_processing_time` of iterator nodes(id:0)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"last_runtime": { | ||
"build_target": "", | ||
"kind": "local" | ||
}, | ||
"provenance": [] | ||
}, | ||
"jupytext": { | ||
"formats": "ipynb,md:myst", | ||
"main_language": "python" | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
-------------------------------------------------------------------------------- | ||
|
||
jupytext: formats: ipynb,md:myst main_language: python text_representation: | ||
extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.1 | ||
kernelspec: display_name: Python 3 | ||
|
||
## name: python3 | ||
|
||
+++ {"id": "OHoxgqr6sRKE"} | ||
|
||
# Performance & Debugging tool | ||
|
||
Grain offers two configurable modes that can be set to gain deeper insights into | ||
pipeline execution and identify potential issues. | ||
[](https://colab.research.google.com/github/google/grain/blob/main/docs/tutorials/dataset_debugging_tutorial.ipynb) | ||
|
||
``` {code-cell} | ||
:id: xw_-jT1r6zNM | ||
!pip install grain | ||
``` | ||
|
||
+++ {"id": "YLaRRlCPsRKE"} | ||
|
||
## Visualization mode | ||
|
||
To get an overview of your dataset pipeline structure and clear understanding of | ||
how the data flows, enable visualization mode. This will log a visual | ||
representation of your pipeline, allowing you to easily identify different | ||
transformation stages and their relationships. To enable visualization mode, set | ||
the flag `--grain_py_dataset_visualization_output_dir=""` or call | ||
`grain.config.update("py_dataset_visualization_output_dir", "")` | ||
|
||
``` {code-cell} | ||
:id: 4y89Wx7PsRKE | ||
import grain.python as grain | ||
grain.config.update("py_dataset_visualization_output_dir", "") | ||
ds = ( | ||
grain.MapDataset.range(20) | ||
.seed(seed=42) | ||
.shuffle() | ||
.batch(batch_size=2) | ||
.map(lambda x: x) | ||
.to_iter_dataset() | ||
) | ||
it = iter(ds) | ||
# Visualization graph is constructed on the first iteration over the dataset | ||
for _ in range(10): | ||
next(it) | ||
``` | ||
|
||
+++ {"id": "_3h-u2I1i7wv"} | ||
|
||
## Debug mode | ||
|
||
To troubleshoot performance issues in your dataset pipeline, enable debug mode. | ||
This will log a real-time execution summary of the pipeline at one-minute | ||
intervals. This execution summary provides a detailed information on each | ||
transformation stage such as processing time, number of elements processed and | ||
other details that helps in identifying the slower stages in the pipeline. To | ||
enable debug mode, set the flag `--grain_py_debug_mode=true` or call | ||
`grain.config.update("py_debug_mode",True)` | ||
|
||
``` {code-cell} | ||
:id: bN45Z58E3jGS | ||
import time | ||
# Define a dummy slow preprocessing function | ||
def _dummy_slow_fn(x): | ||
time.sleep(10) | ||
return x | ||
``` | ||
|
||
``` {code-cell} | ||
--- | ||
colab: | ||
height: 897 | ||
id: bN45Z58E3jGS | ||
outputId: f3d640a8-1eae-414f-e6eb-e7c02c9a91df | ||
--- | ||
import time | ||
grain.config.update("py_debug_mode", True) | ||
ds = ( | ||
grain.MapDataset.range(20) | ||
.seed(seed=42) | ||
.shuffle() | ||
.batch(batch_size=2) | ||
.map(_dummy_slow_fn) | ||
.to_iter_dataset() | ||
.map(_dummy_slow_fn) | ||
) | ||
it = iter(ds) | ||
for _ in range(10): | ||
next(it) | ||
``` | ||
|
||
+++ {"id": "eSu9SOP8_x6A"} | ||
|
||
In the above execution summary, 86% of the time is spent in the | ||
`MapDatasetIterator` node and is the slowest stage of the pipeline. | ||
|
||
Note that although from the `total_processing_time`, it might appear that | ||
`MapMapDataset`(id:2) is the slowest stage, nodes from the id 2 to 6 are | ||
executed in multiple threads and hence, the `total_processing_time` of these | ||
nodes should be compared to the `total_processing_time` of iterator nodes(id:0) |