Merge pull request #12 from DragonHPC/version-update-0.8

merged release 0.8 code
DragonHPC · Mar 19, 2024 · d8d4301 · d8d4301
2 parents 8f17cad + a97e05b
commit d8d4301
Show file tree

Hide file tree

Showing 168 changed files with 9,704 additions and 1,068 deletions.
diff --git a/.devcontainer/constraints.txt b/.devcontainer/constraints.txt
@@ -1,5 +1,5 @@
 alabaster==0.7.12
-attrs==22.1.0
+attrs==23.1.0
 Babel==2.11.0
 black==22.10.0
 breathe==4.34.0

diff --git a/.devcontainer/requirements.txt b/.devcontainer/requirements.txt
@@ -17,3 +17,4 @@ sphinx-copybutton
 vacuum
 wheel
 jupyter
+parsl
diff --git a/doc/.gitignore b/doc/.gitignore
@@ -5,6 +5,8 @@ ref/client/dragon*.rst
 ref/inf/dragon*.rst
 ref/native/Python/dragon*.rst
 ref/mpbridge/dragon*.rst
+ref/data/dragon*.rst
+ref/workflows/dragon*.rst
 ref/native/Python/dragon*.rst
 *.svg
 internal/services/transport_agent/tcp/*
diff --git a/doc/cbook/ai-in-the-loop.rst b/doc/cbook/ai-in-the-loop.rst
@@ -1,35 +1,36 @@
 AI-in-the-loop Workflow
 +++++++++++++++++++++++++++++++++++++++++++++++++++
 
-This is an example of how Dragon can be used to execute an AI-in-the-loop workflow. 
-Inspiration for this demo comes from the NERSC-10 Workflow Archetypes White Paper. 
-This workflow most closely resembles the workflow scenario given as part of archetype four. 
-
-In this example we use a small model implemented in PyTorch to compute an approximation to :math:`\sin(x)`. 
-In parallel to doing the inference with the model, we launch `sim-cheap` on four MPI ranks. 
-This MPI job computes the Taylor approximation to :math:`\sin(x)` and compares this with the output of the model. 
-If the difference is less than 0.05 we consider the model's approximation to be sufficiently accurate and print out the result with the exact result. 
-If the difference is larger than 0.05 we consider this a failure and re-train the model on a new set of data. 
-
-To generate this data we launch `sim-expensive`. 
-This MPI job is launched on eight ranks-per-node and each rank generates 32 data points of the form :math:`(x, \sin(x))` where :math:`x \in X \tilde U(-\pi, \pi)`. 
-This data is aggregated into a PyTorch tensor and then used to train the model. 
-We then re-evaluate the re-trained model and decide if we need to re-train again or if the estimate is sufficiently accurate. 
+This is an example of how Dragon can be used to execute an AI-in-the-loop workflow.
+Inspiration for this demo comes from the NERSC-10 Workflow Archetypes White Paper.
+This workflow most closely resembles the workflow scenario given as part of archetype four.
+
+In this example we use a small model implemented in PyTorch to compute an approximation to :math:`\sin(x)`.
+In parallel to doing the inference with the model, we launch `sim-cheap` on four MPI ranks.
+This MPI job computes the Taylor approximation to :math:`\sin(x)` and compares this with the output of the model.
+If the difference is less than 0.05 we consider the model's approximation to be sufficiently accurate and print out the result with the exact result.
+If the difference is larger than 0.05 we consider this a failure and re-train the model on a new set of data.
+
+To generate this data we launch `sim-expensive`.
+This MPI job is launched on eight ranks-per-node and each rank generates 32 data points of the form :math:`(x, \sin(x))` where :math:`x \in U(-\pi, \pi)`.
+This data is aggregated into a PyTorch tensor and then used to train the model.
+We then re-evaluate the re-trained model and decide if we need to re-train again or if the estimate is sufficiently accurate.
 We continue this loop until we've had five successes.
 
-Figure 1 presents the structure of this main loop. It shows when each MPI application is launched and what portions are executed in parallel.  
+:numref:`ai-in-the-loop`  presents the structure of this main loop. It shows when each MPI application is launched and what portions are executed in parallel.
 
-.. figure:: images/ai-in-the-loop-workflow.jpg 
-    :scale: 30%
+.. figure:: images/ai-in-the-loop-workflow.jpg
+    :scale: 100%
+    :name: ai-in-the-loop
 
-    **Figure 1: Example AI-in-the-loop workflow **
+    **Example AI-in-the-loop workflow**
 
 
 This example consists of the following python files:
 
-* `ai-in-the-loop.py` - This is the main file. It contains functions for launching both MPI executables and parsing the results as well as imports functions defined in `model.py` and coordinates the model inference and training with the MPI jobs. 
+* `ai-in-the-loop.py` - This is the main file. It contains functions for launching both MPI executables and parsing the results as well as imports functions defined in `model.py` and coordinates the model inference and training with the MPI jobs.
 
-* `model.py` - This file defines the model and provides some functions for model training and inference. 
+* `model.py` - This file defines the model and provides some functions for model training and inference.
 
 Below, we present the main python code (`ai-in-the-loop.py`) which acts as the coordinator of the workflow.
 The code of the other files can be found in the release package, inside `examples/workflows/ai-in-the-loop` directory.
@@ -259,7 +260,7 @@ The code of the other files can be found in the release package, inside `example
 Installation
 ============
 
-After installing dragon, the only other dependency is on PyTorch. The PyTorch version and corresponding pip command can be found here (https://pytorch.org/get-started/locally/). 
+After installing dragon, the only other dependency is on PyTorch. The PyTorch version and corresponding pip command can be found here (https://pytorch.org/get-started/locally/).
 
 ```
 > pip install torch torchvision torchaudio
@@ -282,7 +283,7 @@ Example Output when run on 16 nodes with 8 MPI ranks-per-node used to generate d
     > make
     gcc -g  -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib   -c -o sim-cheap.o sim-cheap.c
     gcc -g  -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib  sim-cheap.o -o sim-cheap -lm -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -lmpich
-    gcc -g  -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib   -c -o sim-expensive.o 
+    gcc -g  -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib   -c -o sim-expensive.o
     gcc -g  -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib  sim-expensive.o -o sim-expensive -lm -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -lmpich
     > salloc --nodes=16 --exclusive
     > dragon ai-in-the-loop.py

diff --git a/doc/cbook/basic_pandarallel_demo.rst b/doc/cbook/basic_pandarallel_demo.rst
@@ -1,8 +1,8 @@
 Basic Pandarallel Demonstration for Single Node Environment
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
-This Jupyter benchmark is a simple use case for the pandarallel `parallel_apply` call. 
-It can be run with `dragon` and base multiprocessing to compare performance on your machine. 
+This Jupyter benchmark is a simple use case for the pandarallel `parallel_apply` call.
+It can be run with `dragon` and base multiprocessing to compare performance on your machine.
 
 The program demonstrates how to use `parallel_apply`, the multiprocessing verison of pandas `apply`, on a pandas dataframe with random input.
 
@@ -12,36 +12,4 @@ The code demonstrates the following key concepts working with Dragon:
 * How to use pandarallel and pandas with Dragon and base multiprocessing
 * How pandarallel handles various dtypes
 
-.. code-block:: python
-    :linenos:
-    :caption: **basic_pandarallel_demo.ipynb: A bioinformatics benchmark for aligning nucleotide sequences and amino acid sequences**
-
-    import dragon
-    import multiprocessing
-
-    import cloudpickle
-
-    import numpy as np
-    import pandas as pd
-
-    import pandarallel; pandarallel.__version__
-
-    multiprocessing.set_start_method("dragon")
-    pandarallel.core.dill = cloudpickle
-    pandarallel.core.CONTEXT = multiprocessing.get_context("dragon")
-    pandarallel.pandarallel.initialize(progress_bar=True)
-
-    num_rows = 10
-
-    df = pd.DataFrame(
-        {
-            "seqnum": np.arange(42, (42 + num_rows), dtype=int),
-            #"metric_A": np.random.rand(num_rows),
-            #"metric_B": np.random.rand(num_rows),
-            "metric_C": np.random.rand(num_rows),
-            "alt_seq": np.random.randint(low=42, high=(42 + num_rows), size=(num_rows,)),
-            "label": np.array(list("ATCG"))[np.random.randint(0, 4, num_rows)],
-        },
-    )
-
-    df['highlow_C'] = df['metric_C'].parallel_apply(lambda x: x < cutoff)
+.. literalinclude:: ../../examples/jupyter/doc_ref/basic_pandarallel_demo.py
diff --git a/doc/cbook/bioinfo_alignment_pandarallel_demo.rst b/doc/cbook/bioinfo_alignment_pandarallel_demo.rst
@@ -19,43 +19,12 @@ The code demonstrates the following key concepts working with Dragon:
 * How to utilize pandarallel in a multi-node environment
 * How to utilize k-means clustering on features such as alignment, E value, and percentage coverage
 
-.. code-block:: python
-    :linenos:
-    :caption: **bioinformatics_alignment_pandarallel_demo.ipynb: A bioinformatics benchmark for aligning nucleotide sequences and amino acid sequences**
-
-    import dragon
-    import multiprocessing
-
-    import cloudpickle
-
-    import os
-    os.environ['OPENBLAS_NUM_THREADS'] = '1'
-
-    import numpy as np
-    import pandas as pd
-
-    import Bio
-    from Bio import SeqIO, Entrez
-    import pyalign
-    import time
-    import matplotlib.pyplot as plt
-    from sklearn.cluster import KMeans
-    import seaborn as sns
-    import pandarallel; pandarallel.__version__
-
-    multiprocessing.set_start_method("dragon")
-    pandarallel.core.dill = cloudpickle
-    pandarallel.core.CONTEXT = multiprocessing.get_context("dragon")
-    pandarallel.pandarallel.initialize(progress_bar=True)
-
-    start = time.monotonic()
-    nucl_df['PyAlign Alignment Score'] = nucl_df['Sequence'].parallel_apply(lambda seq2: alignment_algorithm(endo_nucl_seq, seq2, gap=0))
-    stop = time.monotonic()
-    functions, bar_num, tot_time = ['PyAlign Alignment Score'],[128],[stop-start]
+The following notebook was used for the single-node comparison:
 
+.. literalinclude:: ../../examples/jupyter/doc_ref/bioinformatics_alignment_pandarallel_demo.py
 
 For the single-node run, both base multiprocessing and Dragon are compared. The runs utilized a single node with 2 AMD EPYC 7742 64-Core Processors with 128 cores.
-Dragon employs a number of optimizations on base multiprocessing; the Dragon start method outperforms the use of the base multiprocessing spawn start method on the same hardware. 
+Dragon employs a number of optimizations on base multiprocessing; the Dragon start method outperforms the use of the base multiprocessing spawn start method on the same hardware.
 
 The timing for the base multiprocessing runtime is:
 
@@ -102,10 +71,14 @@ The timing for the single-node Dragon runtime is:
      -
      - 27.174203
 
-For multi-node Dragon run, the run was on 2 Apollo nodes. Each Apollo node has 1x AMD Rome CPU with 4x AMD MI100 GPUs and 128 cores. 
-The multi-node use case scales with the total number of CPUs reported by the allocation. As there are more nodes, workers, and CPUs available for multi-node, Dragon extends 
-multiprocessing's stock capabilities and demonstrates additional improvement to measured execution time. 
-Base multiprocessing does not support multi-node workloads. 
+For multi-node Dragon run, the run was on 2 Apollo nodes. Each Apollo node has 1x AMD Rome CPU with 4x AMD MI100 GPUs and 128 cores.
+The multi-node use case scales with the total number of CPUs reported by the allocation. As there are more nodes, workers, and CPUs available for multi-node, Dragon extends
+multiprocessing's stock capabilities and demonstrates additional improvement to measured execution time.
+Base multiprocessing does not support multi-node workloads.
+
+The following notebook was used for the multi-node comparison:
+
+.. literalinclude:: ../../examples/jupyter/doc_ref/bioinformatics_alignment_pandarallel_multinode_demo.py
 
 The timing for the multi-node Dragon runtime is:
 

diff --git a/doc/cbook/dict_torch_dataset.rst b/doc/cbook/dict_torch_dataset.rst
@@ -0,0 +1,83 @@
+PyTorch Dataset Usage with Dragon Distributed Dictionary
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+This example shows how a PyTorch dataset can use a Dragon distributed dictionary to store the data. 
+In principle, the distributed dictionary could be shared among other processes that might interact with the training data between training iterations. 
+The program must be run with GPUs. 
+
+The code demonstrates how the following key concepts work with Dragon:
+
+* How to utilize Dragon and the PyTorch dataloader and neural network model for training on GPUs
+* How to use the distributed Dragon dictionary with multiprocessing queues
+
+.. literalinclude:: ../../examples/dragon_ai/dict_torch_dataset.py
+
+Installation
+============
+
+After installing dragon, the only other dependency is on PyTorch. The PyTorch version and corresponding pip command can be found here (https://pytorch.org/get-started/locally/). 
+
+```
+> pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+```
+
+Description of the system used
+==============================
+
+For this example, an HPE Cray EX was used. Each node has AMD EPYC 7763 64-core CPUs and 4x Nvidia A100 GPUs.
+
+How to run
+==========
+
+Example Output when run on 2 nodes with 2 MNIST workers, 1 device per node, 2 epochs, CUDA training, 4 dragon dict managers, and dragon dict memory.
+-------------------------------------------------------------------------------------
+
+.. code-block:: console
+    :linenos:
+
+    > salloc --nodes=2 -p allgriz --exclusive -t 1:00:00
+    > dragon dict_torch_dataset.py --mnist-workers 4 --devices-per-node 1 --epochs 2
+    Number of nodes: 2
+    Number of MNIST workers: 2
+    Number of dragon dict managers: 4
+    100.0%
+    100.0%
+    100.0%
+    100.0%
+    Rank 0 Train Epoch: 1 [0/60000 (0%)]    Loss: 2.316082
+    Rank 1 Train Epoch: 1 [0/60000 (0%)]    Loss: 2.313832
+    Rank 0 Train Epoch: 1 [6400/60000 (11%)]        Loss: 0.268168
+    Rank 1 Train Epoch: 1 [6400/60000 (11%)]        Loss: 0.436355
+    Rank 0 Train Epoch: 1 [12800/60000 (21%)]       Loss: 0.190972
+    Rank 1 Train Epoch: 1 [12800/60000 (21%)]       Loss: 0.205474
+    Rank 0 Train Epoch: 1 [19200/60000 (32%)]       Loss: 0.187326
+    Rank 1 Train Epoch: 1 [19200/60000 (32%)]       Loss: 0.568415
+    Rank 0 Train Epoch: 1 [25600/60000 (43%)]       Loss: 0.093499
+    Rank 1 Train Epoch: 1 [25600/60000 (43%)]       Loss: 0.058430
+    Rank 0 Train Epoch: 1 [32000/60000 (53%)]       Loss: 0.060121
+    Rank 1 Train Epoch: 1 [32000/60000 (53%)]       Loss: 0.149605
+    Rank 0 Train Epoch: 1 [38400/60000 (64%)]       Loss: 0.156384
+    Rank 1 Train Epoch: 1 [38400/60000 (64%)]       Loss: 0.119814
+    Rank 0 Train Epoch: 1 [44800/60000 (75%)]       Loss: 0.082197
+    Rank 1 Train Epoch: 1 [44800/60000 (75%)]       Loss: 0.096987
+    Rank 0 Train Epoch: 1 [51200/60000 (85%)]       Loss: 0.053689
+    Rank 1 Train Epoch: 1 [51200/60000 (85%)]       Loss: 0.101078
+    Rank 0 Train Epoch: 1 [57600/60000 (96%)]       Loss: 0.031515
+    Rank 1 Train Epoch: 1 [57600/60000 (96%)]       Loss: 0.090198
+    Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
+    Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw/train-images-idx3-ubyte.gz
+    Extracting ./torch-data-dict/data/MNIST/raw/train-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw
+
+    Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
+    Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw/train-labels-idx1-ubyte.gz
+    Extracting ./torch-data-dict/data/MNIST/raw/train-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw
+
+    Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
+    Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw/t10k-images-idx3-ubyte.gz
+    Extracting ./torch-data-dict/data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw
+
+    Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
+    Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw/t10k-labels-idx1-ubyte.gz
+    Extracting ./torch-data-dict/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw
+
+
diff --git a/doc/cbook/distr-inf-telemetry.rst b/doc/cbook/distr-inf-telemetry.rst
@@ -183,34 +183,36 @@ which we update every second until the end event is set. Note line 17 where we s
 Examples of input and Output
 ============================
 
-Figure 1 provides an example of an input and the response the user receives from the chatbot.
-
+:numref:`single-prompt-response` provides an example of an input and the response the user receives from the chatbot.
 
 .. figure:: images/llm-grafana-single-prompt-response.jpg
     :scale: 60%
+    :name: single-prompt-response
 
-    **Figure 1: Input prompt and response with IDs for the prompter, inference worker, and response worker**
+    **Input prompt and response with IDs for the prompter, inference worker, and response worker**
 
 
 
 To simulate many different users iteracting with a chatbot, we loop over a list of fifteen prompts seven times giving a total of 105 prompts that the four inference workers
-to respond to. The input loop and prompts are shown in Figure 2. A sample telemetry output as displayed in Grafana after all these prompts are processed is
-shown in Figure 3. Note how the utilization is nearly equal among the GPUs with all starting and ending at the same time. The spikes in utilization prior to
+to respond to. The input loop and prompts are shown in :numref:`loop-over-prompts`. A sample telemetry output as displayed in Grafana after all these prompts are processed is
+shown in :numref:`node-telemetry` . Note how the utilization is nearly equal among the GPUs with all starting and ending at the same time. The spikes in utilization prior to
 the running of the many prompts are from the models being loaded onto the GPUs at the start up of the inference workers and the worker that responded to the prompt
-in Figure 1.
+in :numref:`single-prompt-response`.
 
 .. figure:: images/llm-grafana-many-prompts.jpg
     :scale: 50%
+    :name: loop-over-prompts
 
-    **Figure 2: Loop over list of prompts to simulate many users**
+    **Loop over list of prompts to simulate many users**
 
 
 
 
 .. figure:: images/llm-grafana-telem-data.jpg
     :scale: 60%
+    :name: node-telemetry
 
-    **Figure 3: Node telemetry data that is visualized using Grafana GUI and highlights the load balanced nature of this example**
+    **Node telemetry data that is visualized using Grafana GUI and highlights the load balanced nature of this example**
 
 
 

diff --git a/doc/cbook/dragon_dict.rst b/doc/cbook/dragon_dict.rst
@@ -11,8 +11,9 @@ Architecture of Dragon Dictionay
 .. figure:: images/dragon_dict_architecture.png
     :align: center
     :scale: 30%
+    :name: high-level-arch
 
-    **Figure 1: High-level architecture of a Dragon Dictionary**
+    **High-level architecture of a Dragon Dictionary**
 
 From Python code, a user instantiates a dragon dictionary specifying the number of back-end managers. During bring-up of a dragon dictionary,
 a pool of manager processes are started along with a collection of dragon channels used for communication between clients and managers. Each of the managers
@@ -95,9 +96,9 @@ The dictionary is spawned from across 1 node to 64 nodes with each manager worke
 with each key of constant size of 30 bytes in the dictionary. The results clearly demonstrate the advantage of distributed dictionary, with increased
 aggregated rate of opearations as the dictionary managers are spawned across the increasing number of nodes.
 
-
 .. figure:: images/dragon_dict_results.png
     :align: center
     :scale: 25%
+    :name: multinode-results
 
-    **Figure 2: Results on a multi-node setup**
+    **Results on a multi-node setup**