Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to 2024 RIKEN tutorial for DYAD #29

Merged
merged 35 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9db7e92
Updates Dockerfile.spawn and requirements.txt to handle the newest ve…
ilumsden Apr 4, 2024
a4c730b
Moves copy/move of tutorial materials into the root user block
ilumsden Apr 4, 2024
4adc40d
Adds intro, tutorial code cells, and extra dependencies to DYAD porti…
ilumsden Apr 5, 2024
41bb218
Adds cleanup for the data generation directory in DYAD portion of tut…
ilumsden Apr 5, 2024
c4e3f8c
Current progress and debugging
ilumsden Apr 10, 2024
223afdc
DLIO version of the Docker file.
hariharan-devarajan Apr 10, 2024
fe854e1
Starts to break the tutorial down into modules
ilumsden Apr 10, 2024
f036fa2
Minor text changes to 02_flux_scheduling.ipynb
ilumsden Apr 10, 2024
560a48e
Adds a description of the scheduling policies used in flux tree example
ilumsden Apr 10, 2024
223861e
Adds an image for flux tree
ilumsden Apr 10, 2024
72d56e4
Adds attribution to images
ilumsden Apr 10, 2024
26f6834
Adds module labels throughout tutorial
ilumsden Apr 10, 2024
89980ed
Adds Flux logo to other notebooks
ilumsden Apr 10, 2024
eaf4b1a
Makes a couple of fixes for DLIO use case
ilumsden Apr 11, 2024
50d7cd6
Last few bugfixes in DYAD notebook
ilumsden Apr 11, 2024
0ab14fc
Minor changes to Flux scheduling notebook to correct job waiting
ilumsden Apr 11, 2024
13cc28c
Moves the YouTube video in the intro to a code cell so it will work i…
ilumsden Apr 11, 2024
b1a3143
Adds DYAD to LD_LIBRARY_PATH before launching DLIO
ilumsden Apr 11, 2024
07033e1
Renames dyad_dlio.ipynb to 04_dyad_dlio.ipynb
ilumsden Apr 11, 2024
03f62e6
Updates Dockerfile.spawn to get the DLIO use case working
ilumsden Apr 11, 2024
9af1117
Adds a tutorial-specific copy of the DYAD Torch data loader
ilumsden Apr 11, 2024
9d1fecd
Minor bugfixes after moving DLIO extensions into the repo
ilumsden Apr 11, 2024
1613907
Simplifies the DYAD Torch data loader
ilumsden Apr 11, 2024
a874218
Adds reference to LC table and flux proxy optional section
ilumsden Apr 11, 2024
5428b9a
Adds a step to docker-builds to remove unneeded stuff from runner
ilumsden Apr 11, 2024
58ed703
Small consistency tweaks
ilumsden Apr 11, 2024
fc7e37b
Adds module 3
ilumsden Apr 11, 2024
f84cd2c
Small summary change
ilumsden Apr 11, 2024
96897bb
Editing and revisions
ilumsden Apr 12, 2024
0d01733
Finishes the DYAD notebook
ilumsden Apr 12, 2024
f24bbed
Adds the supplement and updates the conclusions
ilumsden Apr 12, 2024
14108ca
Tweaks a figure width
ilumsden Apr 12, 2024
55e27e2
Removes dead changes for testing DLIO
ilumsden Apr 12, 2024
7051f86
Tries uncommenting the rm on apt lists
ilumsden Apr 12, 2024
0cb3fb0
Removes the old DYAD notebook from previous years since it's been sup…
ilumsden Apr 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions 2024-RIKEN-AWS/JupyterNotebook/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -339,3 +339,4 @@ websocket-client==1.6.1
# via jupyter-server
# Used for the DYAD notebook
dlio_benchmark @ git+https://github.com/argonne-lcf/dlio_benchmark.git
Pygments
345 changes: 345 additions & 0 deletions 2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/dyad_dlio.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "dd3e912b-3428-4bc7-88bd-97686406b75a",
"metadata": {
"tags": []
},
"source": [
"# Welcome to the DYAD component of the Flux tutorial\n"
]
},
{
"cell_type": "markdown",
"id": "4db0555f",
"metadata": {},
"source": [
"> What is DYAD? 🤔️\n",
"\n",
"DYAD is a locality-aware, write-once, read-many file cache that runs on top of local NVMe and other burst buffer-style technologies (e.g., El Capitan Rabbit nodes). It is designed to accelerate large, distributed workloads, such as distributed Deep Learning (DL) training and scientific computing workflows, on HPC systems. Unlike similar tools (e.g., DataSpaces and UnifyFS), which tend to optimize for write performance, DYAD aims to provide good write **and read** performance. To optimize read performance, DYAD uses a locality-aware \"Hierarchical Data Locator,\" which prioritizes node-local metadata and data retrieval to minimize the amount of network communications. When moving data from another node, DYAD also uses a streaming RPC over RDMA protocol, which uses preallocated buffers and connection caching to maximize network bandwidth. This process is shown in the figure below:\n",
"\n",
"![DYAD Reading Process](img/dyad_design.png)\n",
"\n",
"DYAD uses several services provided by Flux (key-value store, remote proceedure call, broker modules) to orchestrate data movement between nodes. It also uses UCX to move data."
]
},
{
"cell_type": "markdown",
"id": "badb9753",
"metadata": {},
"source": [
"> I'm ready! How do I do this tutorial? 😁️\n",
"\n",
"The process for running this tutorial is the same as `flux.ipynb`. To step through examples in this notebook \n",
"you need to execute cells. To run a cell, press Shift+Enter on your keyboard. If you prefer, you can also paste \n",
"the shell commands in the JupyterLab terminal and execute them there."
]
},
{
"cell_type": "markdown",
"id": "c0a9e0f9",
"metadata": {},
"source": [
"# Accelerating Distributed Deep Learning (DL) Training with DYAD\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "be8da082",
"metadata": {},
"source": [
"## Show code"
]
},
{
"cell_type": "markdown",
"id": "4f018bbc",
"metadata": {},
"source": [
"[data loader](../dlio_extensions/dyad_torch_data_loader.py)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c92da400",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import inspect\n",
"from pygments import highlight\n",
"from pygments.lexers import PythonLexer\n",
"from pygments.formatters import HtmlFormatter\n",
"from IPython.display import display, HTML\n",
"\n",
"sys.path.insert(0, os.path.abspath(\"../dlio_extensions/dyad_torch_data_loader.py\"))\n",
"\n",
"from dyad_torch_data_loader import DYADTorchDataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27e463c0",
"metadata": {},
"outputs": [],
"source": [
"display(HTML(highlight(inspect.getsource(DYADTorchDataset.worker_init), PythonLexer(), HtmlFormatter(full=True))))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab755b0a",
"metadata": {},
"outputs": [],
"source": [
"display(HTML(highlight(inspect.getsource(DYADTorchDataset.__getitem__), PythonLexer(), HtmlFormatter(full=True))))"
]
},
{
"cell_type": "markdown",
"id": "fefd9ae3",
"metadata": {},
"source": [
"## Configure DLIO and DYAD"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92881a8f",
"metadata": {},
"outputs": [],
"source": [
"kvs_namespace = \"dyad\"\n",
"initial_data_directory = \"/tmp/dlio_data\"\n",
"managed_directory = \"/tmp/dyad_data\"\n",
"workers_per_node = 8"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68be24ed",
"metadata": {},
"outputs": [],
"source": [
"dyad_install_prefix = \"/usr\"\n",
"num_nodes = !flux hostlist -c\n",
"dlio_extensions_dir = !$HOME/flux-tutorial-2024/dlio_extensions\n",
"dtl_mode = \"UCX\"\n",
"workload = \"dyad_unet3d_small\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ce527f2",
"metadata": {},
"outputs": [],
"source": [
"env_lines = [\n",
" f\"DYAD_KVS_NAMESPACE={kvs_namespace}\\n\",\n",
" f\"DYAD_DTL_MODE={dtl_mode}\\n\",\n",
" f\"DYAD_PATH={managed_directory}\\n\",\n",
" f\"PYTHONPATH={dlio_extensions_dir}:$PYTHONPATH\\n\",\n",
" \"DLIO_PROFILER_ENABLE=0\\n\",\n",
" \"DLIO_PROFILER_INC_METADATA=1\\n\",\n",
" \"DLIO_PROFILER_LOG_LEVEL=ERROR\\n\",\n",
" \"DLIO_PROFILER_BIND_SIGNALS=0\\n\",\n",
" \"HDF5_USE_FILE_LOCKING=0\\n\",\n",
"]\n",
"with open(\"dlio_env.txt\", \"w\") as f:\n",
" for el in env_lines:\n",
" f.write(el)"
]
},
{
"cell_type": "markdown",
"id": "398e110f",
"metadata": {},
"source": [
"## Create Flux KVS Namespace and start DYAD service"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf132600",
"metadata": {},
"outputs": [],
"source": [
"!flux kvs namespace create {kvs_namespace}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3220ef03",
"metadata": {},
"outputs": [],
"source": [
"!flux exec -r all flux module load {dyad_install_prefix}/lib/dyad.so --mode={dtl_mode} {managed_directory}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4750013c",
"metadata": {},
"outputs": [],
"source": [
"!flux module list"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3322e350",
"metadata": {},
"outputs": [],
"source": [
"!flux kvs namespace list"
]
},
{
"cell_type": "markdown",
"id": "c0dfe655",
"metadata": {},
"source": [
"## Generate Data for Unet3D"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dd03ec1",
"metadata": {},
"outputs": [],
"source": [
"!flux run -N {num_nodes} --tasks-per-node=1 mkdir -p {managed_directory} \n",
"!flux run -N {num_nodes} --tasks-per-node=1 rm -r {managed_directory}/* "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4e5d30e",
"metadata": {},
"outputs": [],
"source": [
"!flux run -N {num_nodes} -o cpu-affinity=off --tasks-per-node={workers_per_node} --env-file=dlio_env.txt \\\n",
" dlio_benchmark --config-dir={dlio_extensions_dir}/configs workload={workload} \\\n",
" ++workload.dataset.data_folder={initial_data_directory} ++workload.workflow.generate_data=True \\\n",
" ++workload.workflow.train=False"
]
},
{
"cell_type": "markdown",
"id": "3f14ffdd",
"metadata": {},
"source": [
"## Run \"training\" through DLIO"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3437a068",
"metadata": {},
"outputs": [],
"source": [
"!flux run -N {num_nodes} -o cpu-affinity=on --tasks-per-node={workers_per_node} --env-file=dlio_env.txt \\\n",
" dlio_benchmark --config-dir={dlio_extensions_dir}/configs workload={workload} \\\n",
" ++workload.dataset.data_folder={initial_data_directory} ++workload.workflow.generate_data=False \\\n",
" ++workload.workflow.train=True"
]
},
{
"cell_type": "markdown",
"id": "573ce232",
"metadata": {},
"source": [
"## Shutdown the DYAD service and cleanup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "755251df",
"metadata": {},
"outputs": [],
"source": [
"!flux kvs namespace remove {kvs_namespace}\n",
"!flux exec -r all flux module remove dyad"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2bf50c8e",
"metadata": {},
"outputs": [],
"source": [
"!flux module list"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e50c926e",
"metadata": {},
"outputs": [],
"source": [
"!flux kvs namespace list"
]
},
{
"cell_type": "markdown",
"id": "81d7d87f-1e09-42c8-b165-8902551f6847",
"metadata": {},
"source": [
"# This concludes the notebook tutorial for DYAD.\n",
"\n",
"If you are interested in learning more about DYAD, check out our [ReadTheDocs page](https://dyad.readthedocs.io/en/latest/), our [GitHub repository](https://github.com/flux-framework/dyad), and our published/presented works:\n",
"* [eScience 2022 Short Paper](https://dyad.readthedocs.io/en/latest/_downloads/27090817b034a89b76e5538e148fea9e/ShortPaper_2022_eScience_LLNL.pdf)\n",
"* [SC 2023 ACM Student Research Competition Extended Abstract](https://github.com/flux-framework/dyad/blob/main/docs/_static/ExtendedAbstract_2023_SC_ACM_SRC_DYAD.pdf)\n",
"* [IPDPS 2024 HiCOMB Workshop Paper](https://github.com/flux-framework/dyad/blob/main/docs/_static/Paper_2024_IPDPS_HiCOMB_DYAD.pdf)\n",
"\n",
"If you are interested in working with us, please reach out to Jae-Seung Yeom ([email protected]), Hariharan Devarajan ([email protected]), or Ian Lumsden ([email protected])."
]
},
{
"cell_type": "markdown",
"id": "d16426a9",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
Expand Up @@ -1623,7 +1623,7 @@
"source": [
"![https://flux-framework.org/flux-operator/_static/images/flux-operator.png](https://flux-framework.org/flux-operator/_static/images/flux-operator.png)\n",
"\n",
"> See you next year! 👋️😎️"
"<!-- >> See you next year! 👋️😎️ -->"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hidden now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just a minor thing, but since I'm not sure if we'll be giving this tutorial at JLESC again next year, I didn't think it made sense to keep for this version of the tutorial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @vsoch, I just realized that this change was in flux.ipynb. Part of the plan for this tutorial was to break that down into sections (indicated by the notebooks starting with numbers). All the content from flux.ipynb should already be in those notebooks, so do you think it's fine just to delete flux.ipynb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just realized that dyad.ipynb is still here. I definitely want to delete that notebook. I'll make a commit for that in a sec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @vsoch, I just realized that this change was in flux.ipynb. Part of the plan for this tutorial was to break that down into sections (indicated by the notebooks starting with numbers). All the content from flux.ipynb should already be in those notebooks, so do you think it's fine just to delete flux.ipynb?

I haven't tested it out, so I can't say, but I would rely on your judgment - if you can confidently say nothing is lost it's OK to delete. I'll be running it locally (building now) and I can try to check too, but absence of something is often hard to detect!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

For me, the biggest thing is to not have it seen by the attendees of the tutorial because I feel like that will get confusing. I could always remove it in Dockerfile.spawn. That way we still have the file on GitHub, but it won't be there to confuse the attendees.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think, @vsoch?

]
}
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading