Skip to content

Commit

Permalink
Merge pull request #12 from DragonHPC/version-update-0.8
Browse files Browse the repository at this point in the history
merged release 0.8 code
  • Loading branch information
mendygral authored Mar 19, 2024
2 parents 8f17cad + a97e05b commit d8d4301
Show file tree
Hide file tree
Showing 168 changed files with 9,704 additions and 1,068 deletions.
2 changes: 1 addition & 1 deletion .devcontainer/constraints.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
alabaster==0.7.12
attrs==22.1.0
attrs==23.1.0
Babel==2.11.0
black==22.10.0
breathe==4.34.0
Expand Down
1 change: 1 addition & 0 deletions .devcontainer/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ sphinx-copybutton
vacuum
wheel
jupyter
parsl
2 changes: 2 additions & 0 deletions doc/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ ref/client/dragon*.rst
ref/inf/dragon*.rst
ref/native/Python/dragon*.rst
ref/mpbridge/dragon*.rst
ref/data/dragon*.rst
ref/workflows/dragon*.rst
ref/native/Python/dragon*.rst
*.svg
internal/services/transport_agent/tcp/*
45 changes: 23 additions & 22 deletions doc/cbook/ai-in-the-loop.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,36 @@
AI-in-the-loop Workflow
+++++++++++++++++++++++++++++++++++++++++++++++++++

This is an example of how Dragon can be used to execute an AI-in-the-loop workflow.
Inspiration for this demo comes from the NERSC-10 Workflow Archetypes White Paper.
This workflow most closely resembles the workflow scenario given as part of archetype four.

In this example we use a small model implemented in PyTorch to compute an approximation to :math:`\sin(x)`.
In parallel to doing the inference with the model, we launch `sim-cheap` on four MPI ranks.
This MPI job computes the Taylor approximation to :math:`\sin(x)` and compares this with the output of the model.
If the difference is less than 0.05 we consider the model's approximation to be sufficiently accurate and print out the result with the exact result.
If the difference is larger than 0.05 we consider this a failure and re-train the model on a new set of data.

To generate this data we launch `sim-expensive`.
This MPI job is launched on eight ranks-per-node and each rank generates 32 data points of the form :math:`(x, \sin(x))` where :math:`x \in X \tilde U(-\pi, \pi)`.
This data is aggregated into a PyTorch tensor and then used to train the model.
We then re-evaluate the re-trained model and decide if we need to re-train again or if the estimate is sufficiently accurate.
This is an example of how Dragon can be used to execute an AI-in-the-loop workflow.
Inspiration for this demo comes from the NERSC-10 Workflow Archetypes White Paper.
This workflow most closely resembles the workflow scenario given as part of archetype four.

In this example we use a small model implemented in PyTorch to compute an approximation to :math:`\sin(x)`.
In parallel to doing the inference with the model, we launch `sim-cheap` on four MPI ranks.
This MPI job computes the Taylor approximation to :math:`\sin(x)` and compares this with the output of the model.
If the difference is less than 0.05 we consider the model's approximation to be sufficiently accurate and print out the result with the exact result.
If the difference is larger than 0.05 we consider this a failure and re-train the model on a new set of data.

To generate this data we launch `sim-expensive`.
This MPI job is launched on eight ranks-per-node and each rank generates 32 data points of the form :math:`(x, \sin(x))` where :math:`x \in U(-\pi, \pi)`.
This data is aggregated into a PyTorch tensor and then used to train the model.
We then re-evaluate the re-trained model and decide if we need to re-train again or if the estimate is sufficiently accurate.
We continue this loop until we've had five successes.

Figure 1 presents the structure of this main loop. It shows when each MPI application is launched and what portions are executed in parallel.
:numref:`ai-in-the-loop` presents the structure of this main loop. It shows when each MPI application is launched and what portions are executed in parallel.

.. figure:: images/ai-in-the-loop-workflow.jpg
:scale: 30%
.. figure:: images/ai-in-the-loop-workflow.jpg
:scale: 100%
:name: ai-in-the-loop

**Figure 1: Example AI-in-the-loop workflow **
**Example AI-in-the-loop workflow**


This example consists of the following python files:

* `ai-in-the-loop.py` - This is the main file. It contains functions for launching both MPI executables and parsing the results as well as imports functions defined in `model.py` and coordinates the model inference and training with the MPI jobs.
* `ai-in-the-loop.py` - This is the main file. It contains functions for launching both MPI executables and parsing the results as well as imports functions defined in `model.py` and coordinates the model inference and training with the MPI jobs.

* `model.py` - This file defines the model and provides some functions for model training and inference.
* `model.py` - This file defines the model and provides some functions for model training and inference.

Below, we present the main python code (`ai-in-the-loop.py`) which acts as the coordinator of the workflow.
The code of the other files can be found in the release package, inside `examples/workflows/ai-in-the-loop` directory.
Expand Down Expand Up @@ -259,7 +260,7 @@ The code of the other files can be found in the release package, inside `example
Installation
============

After installing dragon, the only other dependency is on PyTorch. The PyTorch version and corresponding pip command can be found here (https://pytorch.org/get-started/locally/).
After installing dragon, the only other dependency is on PyTorch. The PyTorch version and corresponding pip command can be found here (https://pytorch.org/get-started/locally/).

```
> pip install torch torchvision torchaudio
Expand All @@ -282,7 +283,7 @@ Example Output when run on 16 nodes with 8 MPI ranks-per-node used to generate d
> make
gcc -g -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -c -o sim-cheap.o sim-cheap.c
gcc -g -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib sim-cheap.o -o sim-cheap -lm -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -lmpich
gcc -g -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -c -o sim-expensive.o
gcc -g -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -c -o sim-expensive.o
gcc -g -pedantic -Wall -I /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/include -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib sim-expensive.o -o sim-expensive -lm -L /opt/cray/pe/mpich/8.1.26/ofi/gnu/9.1/lib -lmpich
> salloc --nodes=16 --exclusive
> dragon ai-in-the-loop.py
Expand Down
38 changes: 3 additions & 35 deletions doc/cbook/basic_pandarallel_demo.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Basic Pandarallel Demonstration for Single Node Environment
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This Jupyter benchmark is a simple use case for the pandarallel `parallel_apply` call.
It can be run with `dragon` and base multiprocessing to compare performance on your machine.
This Jupyter benchmark is a simple use case for the pandarallel `parallel_apply` call.
It can be run with `dragon` and base multiprocessing to compare performance on your machine.

The program demonstrates how to use `parallel_apply`, the multiprocessing verison of pandas `apply`, on a pandas dataframe with random input.

Expand All @@ -12,36 +12,4 @@ The code demonstrates the following key concepts working with Dragon:
* How to use pandarallel and pandas with Dragon and base multiprocessing
* How pandarallel handles various dtypes

.. code-block:: python
:linenos:
:caption: **basic_pandarallel_demo.ipynb: A bioinformatics benchmark for aligning nucleotide sequences and amino acid sequences**
import dragon
import multiprocessing
import cloudpickle
import numpy as np
import pandas as pd
import pandarallel; pandarallel.__version__
multiprocessing.set_start_method("dragon")
pandarallel.core.dill = cloudpickle
pandarallel.core.CONTEXT = multiprocessing.get_context("dragon")
pandarallel.pandarallel.initialize(progress_bar=True)
num_rows = 10
df = pd.DataFrame(
{
"seqnum": np.arange(42, (42 + num_rows), dtype=int),
#"metric_A": np.random.rand(num_rows),
#"metric_B": np.random.rand(num_rows),
"metric_C": np.random.rand(num_rows),
"alt_seq": np.random.randint(low=42, high=(42 + num_rows), size=(num_rows,)),
"label": np.array(list("ATCG"))[np.random.randint(0, 4, num_rows)],
},
)
df['highlow_C'] = df['metric_C'].parallel_apply(lambda x: x < cutoff)
.. literalinclude:: ../../examples/jupyter/doc_ref/basic_pandarallel_demo.py
49 changes: 11 additions & 38 deletions doc/cbook/bioinfo_alignment_pandarallel_demo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,43 +19,12 @@ The code demonstrates the following key concepts working with Dragon:
* How to utilize pandarallel in a multi-node environment
* How to utilize k-means clustering on features such as alignment, E value, and percentage coverage

.. code-block:: python
:linenos:
:caption: **bioinformatics_alignment_pandarallel_demo.ipynb: A bioinformatics benchmark for aligning nucleotide sequences and amino acid sequences**
import dragon
import multiprocessing
import cloudpickle
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
import numpy as np
import pandas as pd
import Bio
from Bio import SeqIO, Entrez
import pyalign
import time
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
import pandarallel; pandarallel.__version__
multiprocessing.set_start_method("dragon")
pandarallel.core.dill = cloudpickle
pandarallel.core.CONTEXT = multiprocessing.get_context("dragon")
pandarallel.pandarallel.initialize(progress_bar=True)
start = time.monotonic()
nucl_df['PyAlign Alignment Score'] = nucl_df['Sequence'].parallel_apply(lambda seq2: alignment_algorithm(endo_nucl_seq, seq2, gap=0))
stop = time.monotonic()
functions, bar_num, tot_time = ['PyAlign Alignment Score'],[128],[stop-start]
The following notebook was used for the single-node comparison:

.. literalinclude:: ../../examples/jupyter/doc_ref/bioinformatics_alignment_pandarallel_demo.py

For the single-node run, both base multiprocessing and Dragon are compared. The runs utilized a single node with 2 AMD EPYC 7742 64-Core Processors with 128 cores.
Dragon employs a number of optimizations on base multiprocessing; the Dragon start method outperforms the use of the base multiprocessing spawn start method on the same hardware.
Dragon employs a number of optimizations on base multiprocessing; the Dragon start method outperforms the use of the base multiprocessing spawn start method on the same hardware.

The timing for the base multiprocessing runtime is:

Expand Down Expand Up @@ -102,10 +71,14 @@ The timing for the single-node Dragon runtime is:
-
- 27.174203

For multi-node Dragon run, the run was on 2 Apollo nodes. Each Apollo node has 1x AMD Rome CPU with 4x AMD MI100 GPUs and 128 cores.
The multi-node use case scales with the total number of CPUs reported by the allocation. As there are more nodes, workers, and CPUs available for multi-node, Dragon extends
multiprocessing's stock capabilities and demonstrates additional improvement to measured execution time.
Base multiprocessing does not support multi-node workloads.
For multi-node Dragon run, the run was on 2 Apollo nodes. Each Apollo node has 1x AMD Rome CPU with 4x AMD MI100 GPUs and 128 cores.
The multi-node use case scales with the total number of CPUs reported by the allocation. As there are more nodes, workers, and CPUs available for multi-node, Dragon extends
multiprocessing's stock capabilities and demonstrates additional improvement to measured execution time.
Base multiprocessing does not support multi-node workloads.

The following notebook was used for the multi-node comparison:

.. literalinclude:: ../../examples/jupyter/doc_ref/bioinformatics_alignment_pandarallel_multinode_demo.py

The timing for the multi-node Dragon runtime is:

Expand Down
83 changes: 83 additions & 0 deletions doc/cbook/dict_torch_dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
PyTorch Dataset Usage with Dragon Distributed Dictionary
++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This example shows how a PyTorch dataset can use a Dragon distributed dictionary to store the data.
In principle, the distributed dictionary could be shared among other processes that might interact with the training data between training iterations.
The program must be run with GPUs.

The code demonstrates how the following key concepts work with Dragon:

* How to utilize Dragon and the PyTorch dataloader and neural network model for training on GPUs
* How to use the distributed Dragon dictionary with multiprocessing queues

.. literalinclude:: ../../examples/dragon_ai/dict_torch_dataset.py

Installation
============

After installing dragon, the only other dependency is on PyTorch. The PyTorch version and corresponding pip command can be found here (https://pytorch.org/get-started/locally/).

```
> pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

Description of the system used
==============================

For this example, an HPE Cray EX was used. Each node has AMD EPYC 7763 64-core CPUs and 4x Nvidia A100 GPUs.

How to run
==========

Example Output when run on 2 nodes with 2 MNIST workers, 1 device per node, 2 epochs, CUDA training, 4 dragon dict managers, and dragon dict memory.
-------------------------------------------------------------------------------------

.. code-block:: console
:linenos:
> salloc --nodes=2 -p allgriz --exclusive -t 1:00:00
> dragon dict_torch_dataset.py --mnist-workers 4 --devices-per-node 1 --epochs 2
Number of nodes: 2
Number of MNIST workers: 2
Number of dragon dict managers: 4
100.0%
100.0%
100.0%
100.0%
Rank 0 Train Epoch: 1 [0/60000 (0%)] Loss: 2.316082
Rank 1 Train Epoch: 1 [0/60000 (0%)] Loss: 2.313832
Rank 0 Train Epoch: 1 [6400/60000 (11%)] Loss: 0.268168
Rank 1 Train Epoch: 1 [6400/60000 (11%)] Loss: 0.436355
Rank 0 Train Epoch: 1 [12800/60000 (21%)] Loss: 0.190972
Rank 1 Train Epoch: 1 [12800/60000 (21%)] Loss: 0.205474
Rank 0 Train Epoch: 1 [19200/60000 (32%)] Loss: 0.187326
Rank 1 Train Epoch: 1 [19200/60000 (32%)] Loss: 0.568415
Rank 0 Train Epoch: 1 [25600/60000 (43%)] Loss: 0.093499
Rank 1 Train Epoch: 1 [25600/60000 (43%)] Loss: 0.058430
Rank 0 Train Epoch: 1 [32000/60000 (53%)] Loss: 0.060121
Rank 1 Train Epoch: 1 [32000/60000 (53%)] Loss: 0.149605
Rank 0 Train Epoch: 1 [38400/60000 (64%)] Loss: 0.156384
Rank 1 Train Epoch: 1 [38400/60000 (64%)] Loss: 0.119814
Rank 0 Train Epoch: 1 [44800/60000 (75%)] Loss: 0.082197
Rank 1 Train Epoch: 1 [44800/60000 (75%)] Loss: 0.096987
Rank 0 Train Epoch: 1 [51200/60000 (85%)] Loss: 0.053689
Rank 1 Train Epoch: 1 [51200/60000 (85%)] Loss: 0.101078
Rank 0 Train Epoch: 1 [57600/60000 (96%)] Loss: 0.031515
Rank 1 Train Epoch: 1 [57600/60000 (96%)] Loss: 0.090198
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw/train-images-idx3-ubyte.gz
Extracting ./torch-data-dict/data/MNIST/raw/train-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting ./torch-data-dict/data/MNIST/raw/train-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ./torch-data-dict/data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./torch-data-dict/data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ./torch-data-dict/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./torch-data-dict/data/MNIST/raw
18 changes: 10 additions & 8 deletions doc/cbook/distr-inf-telemetry.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,34 +183,36 @@ which we update every second until the end event is set. Note line 17 where we s
Examples of input and Output
============================

Figure 1 provides an example of an input and the response the user receives from the chatbot.

:numref:`single-prompt-response` provides an example of an input and the response the user receives from the chatbot.

.. figure:: images/llm-grafana-single-prompt-response.jpg
:scale: 60%
:name: single-prompt-response

**Figure 1: Input prompt and response with IDs for the prompter, inference worker, and response worker**
**Input prompt and response with IDs for the prompter, inference worker, and response worker**



To simulate many different users iteracting with a chatbot, we loop over a list of fifteen prompts seven times giving a total of 105 prompts that the four inference workers
to respond to. The input loop and prompts are shown in Figure 2. A sample telemetry output as displayed in Grafana after all these prompts are processed is
shown in Figure 3. Note how the utilization is nearly equal among the GPUs with all starting and ending at the same time. The spikes in utilization prior to
to respond to. The input loop and prompts are shown in :numref:`loop-over-prompts`. A sample telemetry output as displayed in Grafana after all these prompts are processed is
shown in :numref:`node-telemetry` . Note how the utilization is nearly equal among the GPUs with all starting and ending at the same time. The spikes in utilization prior to
the running of the many prompts are from the models being loaded onto the GPUs at the start up of the inference workers and the worker that responded to the prompt
in Figure 1.
in :numref:`single-prompt-response`.

.. figure:: images/llm-grafana-many-prompts.jpg
:scale: 50%
:name: loop-over-prompts

**Figure 2: Loop over list of prompts to simulate many users**
**Loop over list of prompts to simulate many users**




.. figure:: images/llm-grafana-telem-data.jpg
:scale: 60%
:name: node-telemetry

**Figure 3: Node telemetry data that is visualized using Grafana GUI and highlights the load balanced nature of this example**
**Node telemetry data that is visualized using Grafana GUI and highlights the load balanced nature of this example**



Expand Down
7 changes: 4 additions & 3 deletions doc/cbook/dragon_dict.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ Architecture of Dragon Dictionay
.. figure:: images/dragon_dict_architecture.png
:align: center
:scale: 30%
:name: high-level-arch

**Figure 1: High-level architecture of a Dragon Dictionary**
**High-level architecture of a Dragon Dictionary**

From Python code, a user instantiates a dragon dictionary specifying the number of back-end managers. During bring-up of a dragon dictionary,
a pool of manager processes are started along with a collection of dragon channels used for communication between clients and managers. Each of the managers
Expand Down Expand Up @@ -95,9 +96,9 @@ The dictionary is spawned from across 1 node to 64 nodes with each manager worke
with each key of constant size of 30 bytes in the dictionary. The results clearly demonstrate the advantage of distributed dictionary, with increased
aggregated rate of opearations as the dictionary managers are spawned across the increasing number of nodes.


.. figure:: images/dragon_dict_results.png
:align: center
:scale: 25%
:name: multinode-results

**Figure 2: Results on a multi-node setup**
**Results on a multi-node setup**
Loading

0 comments on commit d8d4301

Please sign in to comment.