Add tutorial for Pygrain dataset debugging tool

PiperOrigin-RevId: 726581897
google · Feb 21, 2025 · 078b265 · 078b265
1 parent 42039a1
commit 078b265
Show file tree

Hide file tree

Showing 4 changed files with 350 additions and 0 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -97,4 +97,5 @@
     'tutorials/dataset_advanced_tutorial.ipynb',
     'tutorials/dataset_basic_tutorial.ipynb',
     'tutorials/data_loader_tutorial.ipynb',
+    'tutorials/dataset_debugging_tutorial.ipynb',
 ]
diff --git a/docs/index.md b/docs/index.md
@@ -65,6 +65,7 @@ data_loader/transformations
 tutorials/data_loader_tutorial
 tutorials/dataset_basic_tutorial
 tutorials/dataset_advanced_tutorial
+tutorials/dataset_debugging_tutorial
 ```
 
 <!-- Automatically generated documentation from docstrings -->

diff --git a/docs/tutorials/dataset_debugging_tutorial.ipynb b/docs/tutorials/dataset_debugging_tutorial.ipynb
@@ -0,0 +1,237 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "OHoxgqr6sRKE"
+   },
+   "source": [
+    "# Performance & Debugging tool\n",
+    "Grain offers two configurable modes that can be set to gain deeper insights into pipeline execution and identify potential issues.\n",
+    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/grain/blob/main/docs/tutorials/dataset_debugging_tutorial.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "xw_-jT1r6zNM"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install grain"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YLaRRlCPsRKE"
+   },
+   "source": [
+    "## Visualization mode\n",
+    "To get an overview of your dataset pipeline structure and clear understanding of how the data flows, enable visualization mode. This will log a visual representation of your pipeline, allowing you to easily identify different transformation stages and their relationships. To enable visualization mode, set the `_DATASET_VISUALIZATION_OUTPUT_DIR` config option using flag `--grain_py_dataset_visualization_output_dir=\"\"` or by calling `grain.config.update(\"py_dataset_visualization_output_dir\", \"\")`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "4y89Wx7PsRKE"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "scrollable": true,
+     "text": [
+      "Grain Dataset graph:\n",
+      "\n",
+      "RangeMapDataset(start=0, stop=20, step=1)\n",
+      "  ││\n",
+      "  ││  \n",
+      "  ││\n",
+      "  ╲╱\n",
+      "\"<class 'int'>[]\"\n",
+      "\n",
+      "  ││\n",
+      "  ││  WithOptionsMapDataset\n",
+      "  ││\n",
+      "  ╲╱\n",
+      "\"<class 'int'>[]\"\n",
+      "\n",
+      "  ││\n",
+      "  ││  ShuffleMapDataset\n",
+      "  ││\n",
+      "  ╲╱\n",
+      "\"<class 'int'>[]\"\n",
+      "\n",
+      "  ││\n",
+      "  ││  BatchMapDataset(batch_size=2, drop_remainder=False)\n",
+      "  ││\n",
+      "  ╲╱\n",
+      "'int64[2]'\n",
+      "\n",
+      "  ││\n",
+      "  ││  MapMapDataset(transform=<lambda> @ <ipython-input-1-930f8fd1bf7d>:9)\n",
+      "  ││\n",
+      "  ╲╱\n",
+      "'int64[2]'\n",
+      "\n",
+      "  ││\n",
+      "  ││  PrefetchDatasetIterator(read_options=ReadOptions(num_threads=16, prefetch_buffer_size=500), allow_nones=False)\n",
+      "  ││\n",
+      "  ╲╱\n",
+      "'int64[2]'\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import grain.python as grain\n",
+    "\n",
+    "grain.config.update(\"py_dataset_visualization_output_dir\", \"\")\n",
+    "ds = (\n",
+    "    grain.MapDataset.range(20)\n",
+    "    .seed(seed=42)\n",
+    "    .shuffle()\n",
+    "    .batch(batch_size=2)\n",
+    "    .map(lambda x: x)\n",
+    "    .to_iter_dataset()\n",
+    ")\n",
+    "it = iter(ds)\n",
+    "\n",
+    "# Visualization graph is constructed while iterating through the pipeline\n",
+    "for _ in range(10):\n",
+    "  next(it)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "_3h-u2I1i7wv"
+   },
+   "source": [
+    "## Debug mode\n",
+    "To troubleshoot performance issues in your dataset pipeline, enable debug mode. This will log a real-time execution summary of the pipeline at one-minute intervals. This execution summary provides a detailed information on each transformation stage such as processing time, number of elements processed and other details that helps in identifying the slower stages in the pipeline. To enable debug mode, set the `_DEBUG_MODE` config option using flag `--grain_py_debug_mode=true` or by calling `grain.config.update(\"py_debug_mode\",True)`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "bN45Z58E3jGS"
+   },
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "# Define a dummy slow preprocessing function\n",
+    "def _dummy_slow_fn(x):\n",
+    "  time.sleep(10)\n",
+    "  return x"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "height": 897
+    },
+    "id": "bN45Z58E3jGS",
+    "outputId": "f3d640a8-1eae-414f-e6eb-e7c02c9a91df"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "scrollable": true,
+     "text": [
+      "Grain Dataset Execution Summary:\n",
+      "\n",
+      "NOTE: Before analyzing the `MapDataset` nodes, ensure that the `total_processing_time` of the `PrefetchDatasetIterator` node indicates it is a bottleneck. The `MapDataset` nodes are executed in multiple threads and thus, should not be compared to the `total_processing_time` of `DatasetIterator` nodes.\n",
+      "\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| id | name                           | inputs | percent wait time | total processing time | min processing time | max processing time | avg processing time | num produced elements |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 6  | RangeMapDataset(start=0, stop= | []     | 0.00%             | 86.92us               | 1.00us              | 53.91us             | 4.35us              | 20                    |\n",
+      "|    | 20, step=1)                    |        |                   |                       |                     |                     |                     |                       |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 5  | WithOptionsMapDataset          | [6]    | 0.00%             | N/A                   | N/A                 | N/A                 | N/A                 | N/A                   |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 4  | ShuffleMapDataset              | [5]    | 0.00%             | 15.95ms               | 42.40us             | 2.28ms              | 797.35us            | 20                    |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 3  | BatchMapDataset(batch_size=2,  | [4]    | 0.00%             | 803.14us              | 47.04us             | 290.24us            | 80.31us             | 10                    |\n",
+      "|    | drop_remainder=False)          |        |                   |                       |                     |                     |                     |                       |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 2  | MapMapDataset(transform=_dummy | [3]    | 16.68%            | 100.08s               | 10.00s              | 10.01s              | 10.01s              | 10                    |\n",
+      "|    | _slow_fn @ <ipython-input-2-23 |        |                   |                       |                     |                     |                     |                       |\n",
+      "|    | 02a47a813f>:4)                 |        |                   |                       |                     |                     |                     |                       |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 1  | PrefetchDatasetIterator(read_o | [2]    | N/A               | 10.02s                | 12.40us             | 10.02s              | 1.67s               | 6                     |\n",
+      "|    | ptions=ReadOptions(num_threads |        |                   |                       |                     |                     |                     |                       |\n",
+      "|    | =16, prefetch_buffer_size=500) |        |                   |                       |                     |                     |                     |                       |\n",
+      "|    | , allow_nones=False)           |        |                   |                       |                     |                     |                     |                       |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "| 0  | MapDatasetIterator(transform=_ | [1]    | 83.32%            | 50.05s                | 10.01s              | 10.01s              | 10.01s              | 5                     |\n",
+      "|    | dummy_slow_fn @ <ipython-input |        |                   |                       |                     |                     |                     |                       |\n",
+      "|    | -2-2302a47a813f>:4)            |        |                   |                       |                     |                     |                     |                       |\n",
+      "|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import time\n",
+    "grain.config.update(\"py_debug_mode\", True)\n",
+    "\n",
+    "ds = (\n",
+    "    grain.MapDataset.range(20)\n",
+    "    .seed(seed=42)\n",
+    "    .shuffle()\n",
+    "    .batch(batch_size=2)\n",
+    "    .map(_dummy_slow_fn)\n",
+    "    .to_iter_dataset()\n",
+    "    .map(_dummy_slow_fn)\n",
+    ")\n",
+    "it = iter(ds)\n",
+    "\n",
+    "for _ in range(10):\n",
+    "  next(it)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "eSu9SOP8_x6A"
+   },
+   "source": [
+    "In the above execution summary, 86% of the time is spent in the `MapDatasetIterator` node and is the slowest stage of the pipeline.\n",
+    "\n",
+    "Note that although from the `total_processing_time`, it might appear that `MapMapDataset`(id:2) is the slowest stage, nodes from the id 2 to 6 are executed in multiple threads and hence, the `total_processing_time` of these nodes should be compared to the `total_processing_time` of iterator nodes(id:0)"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "last_runtime": {
+    "build_target": "",
+    "kind": "local"
+   },
+   "provenance": []
+  },
+  "jupytext": {
+   "formats": "ipynb,md:myst",
+   "main_language": "python"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/docs/tutorials/dataset_debugging_tutorial.md b/docs/tutorials/dataset_debugging_tutorial.md
@@ -0,0 +1,111 @@
+--------------------------------------------------------------------------------
+
+jupytext: formats: ipynb,md:myst main_language: python text_representation:
+extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.1
+kernelspec: display_name: Python 3
+
+## name: python3
+
++++ {"id": "OHoxgqr6sRKE"}
+
+# Performance & Debugging tool
+
+Grain offers two configurable modes that can be set to gain deeper insights into
+pipeline execution and identify potential issues.
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/grain/blob/main/docs/tutorials/dataset_debugging_tutorial.ipynb)
+
+``` {code-cell}
+:id: xw_-jT1r6zNM
+
+!pip install grain
+```
+
++++ {"id": "YLaRRlCPsRKE"}
+
+## Visualization mode
+
+To get an overview of your dataset pipeline structure and clear understanding of
+how the data flows, enable visualization mode. This will log a visual
+representation of your pipeline, allowing you to easily identify different
+transformation stages and their relationships. To enable visualization mode, set
+the flag `--grain_py_dataset_visualization_output_dir=""` or call
+`grain.config.update("py_dataset_visualization_output_dir", "")`
+
+``` {code-cell}
+:id: 4y89Wx7PsRKE
+
+import grain.python as grain
+
+grain.config.update("py_dataset_visualization_output_dir", "")
+ds = (
+    grain.MapDataset.range(20)
+    .seed(seed=42)
+    .shuffle()
+    .batch(batch_size=2)
+    .map(lambda x: x)
+    .to_iter_dataset()
+)
+it = iter(ds)
+
+# Visualization graph is constructed on the first iteration over the dataset
+for _ in range(10):
+  next(it)
+```
+
++++ {"id": "_3h-u2I1i7wv"}
+
+## Debug mode
+
+To troubleshoot performance issues in your dataset pipeline, enable debug mode.
+This will log a real-time execution summary of the pipeline at one-minute
+intervals. This execution summary provides a detailed information on each
+transformation stage such as processing time, number of elements processed and
+other details that helps in identifying the slower stages in the pipeline. To
+enable debug mode, set the flag `--grain_py_debug_mode=true` or call
+`grain.config.update("py_debug_mode",True)`
+
+``` {code-cell}
+:id: bN45Z58E3jGS
+
+import time
+
+# Define a dummy slow preprocessing function
+def _dummy_slow_fn(x):
+  time.sleep(10)
+  return x
+```
+
+``` {code-cell}
+---
+colab:
+  height: 897
+id: bN45Z58E3jGS
+outputId: f3d640a8-1eae-414f-e6eb-e7c02c9a91df
+---
+import time
+grain.config.update("py_debug_mode", True)
+
+ds = (
+    grain.MapDataset.range(20)
+    .seed(seed=42)
+    .shuffle()
+    .batch(batch_size=2)
+    .map(_dummy_slow_fn)
+    .to_iter_dataset()
+    .map(_dummy_slow_fn)
+)
+it = iter(ds)
+
+for _ in range(10):
+  next(it)
+```
+
++++ {"id": "eSu9SOP8_x6A"}
+
+In the above execution summary, 86% of the time is spent in the
+`MapDatasetIterator` node and is the slowest stage of the pipeline.
+
+Note that although from the `total_processing_time`, it might appear that
+`MapMapDataset`(id:2) is the slowest stage, nodes from the id 2 to 6 are
+executed in multiple threads and hence, the `total_processing_time` of these
+nodes should be compared to the `total_processing_time` of iterator nodes(id:0)