Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New examples for the updated documentation #495

Merged
merged 39 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
73dc0de
new examples
jan-janssen Nov 12, 2024
b5bf962
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 12, 2024
a31700b
Build notebooks as tests
jan-janssen Nov 12, 2024
362a25a
add executor bit
jan-janssen Nov 12, 2024
8cb4dd7
extend notebook environment
jan-janssen Nov 12, 2024
b2088a0
Update 3-hpc-allocation.ipynb
jan-janssen Nov 12, 2024
a35430e
Add key features
jan-janssen Nov 13, 2024
3ab770e
Merge remote-tracking branch 'origin/main' into examples
jan-janssen Nov 15, 2024
ee2a158
update key arguments
jan-janssen Nov 15, 2024
f688059
Merge remote-tracking branch 'origin/main' into examples
jan-janssen Nov 15, 2024
e8a9987
Work in progress for the readme
jan-janssen Nov 15, 2024
faa2c62
Update readme
jan-janssen Nov 15, 2024
9d688df
add new lines
jan-janssen Nov 15, 2024
936fa62
Merge remote-tracking branch 'origin/main' into examples
jan-janssen Nov 15, 2024
acede13
Change Backend Names
jan-janssen Nov 15, 2024
ee532a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 15, 2024
a1359d4
Update __init__.py
jan-janssen Nov 15, 2024
74bee15
Merge remote-tracking branch 'origin/main' into examples
jan-janssen Nov 19, 2024
294db24
update readme
jan-janssen Nov 19, 2024
8fa5d7b
Update installation
jan-janssen Nov 19, 2024
307fb40
Fix init
jan-janssen Nov 19, 2024
52d9723
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 19, 2024
7a6b068
update local notebook
jan-janssen Nov 19, 2024
59f6b97
Merge remote-tracking branch 'origin/examples' into examples
jan-janssen Nov 19, 2024
948ea0e
Merge remote-tracking branch 'origin/main' into examples
jan-janssen Nov 19, 2024
4432c65
update local example notebook
jan-janssen Nov 19, 2024
e209d8c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 19, 2024
103ff06
Explain jupyter kernel installation
jan-janssen Nov 19, 2024
1481110
Merge remote-tracking branch 'origin/examples' into examples
jan-janssen Nov 19, 2024
bca3c32
copy existing kernel
jan-janssen Nov 19, 2024
b8ab12f
Add HPC submission
jan-janssen Nov 19, 2024
6fabda8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 19, 2024
a6ca0f5
execute HPC notebook once
jan-janssen Nov 20, 2024
21b7099
hpc allocation
jan-janssen Nov 20, 2024
ee386f7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 20, 2024
1632ba1
Update documentation
jan-janssen Nov 20, 2024
99e5126
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 20, 2024
0b239e2
replace HPC submission notebook
jan-janssen Nov 20, 2024
a11e378
Merge remote-tracking branch 'origin/examples' into examples
jan-janssen Nov 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .ci_support/build_notebooks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
# execute notebooks
i=0;
for notebook in $(ls notebooks/*.ipynb); do
papermill ${notebook} ${notebook%.*}-out.${notebook##*.} -k python3 || i=$((i+1));
done;
Comment on lines +4 to +6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix fragile file iteration and improve error logging

The current implementation has several issues:

  1. Using ls for iteration is fragile and will break with spaces in filenames
  2. Failed notebook executions aren't logged
  3. No progress indication during execution

Apply this diff to make the script more robust:

-for notebook in $(ls notebooks/*.ipynb); do
-    papermill ${notebook} ${notebook%.*}-out.${notebook##*.} -k python3 || i=$((i+1));
-done;
+# Create a log directory
+mkdir -p logs
+
+# Use shell globbing instead of ls
+for notebook in notebooks/*.ipynb; do
+    if [ ! -f "$notebook" ]; then
+        continue  # Skip if no notebooks found
+    fi
+    
+    echo "Processing: $notebook"
+    base_name=$(basename "$notebook" .ipynb)
+    output_file="${notebook%.*}-out.${notebook##*.}"
+    log_file="logs/${base_name}.log"
+    
+    if ! papermill "$notebook" "$output_file" -k python3 2>&1 | tee "$log_file"; then
+        echo "Failed to execute: $notebook" >&2
+        i=$((i+1))
+    fi
+done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for notebook in $(ls notebooks/*.ipynb); do
papermill ${notebook} ${notebook%.*}-out.${notebook##*.} -k python3 || i=$((i+1));
done;
# Create a log directory
mkdir -p logs
# Use shell globbing instead of ls
for notebook in notebooks/*.ipynb; do
if [ ! -f "$notebook" ]; then
continue # Skip if no notebooks found
fi
echo "Processing: $notebook"
base_name=$(basename "$notebook" .ipynb)
output_file="${notebook%.*}-out.${notebook##*.}"
log_file="logs/${base_name}.log"
if ! papermill "$notebook" "$output_file" -k python3 2>&1 | tee "$log_file"; then
echo "Failed to execute: $notebook" >&2
i=$((i+1))
fi
done
🧰 Tools
🪛 Shellcheck

[error] 4-4: Iterating over ls output is fragile. Use globs.

(SC2045)


# push error to next level
if [ $i -gt 0 ]; then
exit 1;
fi;
Comment on lines +8 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error reporting with execution summary

The current error handling doesn't provide enough information about the execution results.

Apply this diff to improve error reporting:

-# push error to next level
-if [ $i -gt 0 ]; then
-    exit 1;
-fi;
+# Print execution summary
+total=$(find notebooks -name "*.ipynb" | wc -l)
+successful=$((total - i))
+
+echo "Notebook Execution Summary:"
+echo "-------------------------"
+echo "Total notebooks: $total"
+echo "Successfully executed: $successful"
+echo "Failed executions: $i"
+
+if [ $i -gt 0 ]; then
+    echo "Error: $i notebook(s) failed execution. Check logs directory for details." >&2
+    exit 1
+fi
+
+echo "All notebooks executed successfully!"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# push error to next level
if [ $i -gt 0 ]; then
exit 1;
fi;
# Print execution summary
total=$(find notebooks -name "*.ipynb" | wc -l)
successful=$((total - i))
echo "Notebook Execution Summary:"
echo "-------------------------"
echo "Total notebooks: $total"
echo "Successfully executed: $successful"
echo "Failed executions: $i"
if [ $i -gt 0 ]; then
echo "Error: $i notebook(s) failed execution. Check logs directory for details." >&2
exit 1
fi
echo "All notebooks executed successfully!"

2 changes: 1 addition & 1 deletion .github/workflows/notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,4 @@ jobs:
timeout-minutes: 5
run: >
flux start
papermill notebooks/examples.ipynb examples-out.ipynb -k "python3"
.ci_support/build_notebooks.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Increase the workflow timeout.

The current 5-minute timeout may be insufficient for executing all 5 notebooks. Consider increasing the timeout-minutes value to ensure successful completion.

🔗 Analysis chain

Consider increasing the workflow timeout.

The workflow is now executing multiple notebooks through the shell script, but still uses a 5-minute timeout. This might be insufficient when running all notebooks sequentially, especially those involving HPC and resource allocation examples.

Consider:

  1. Increasing the timeout-minutes value
  2. Adding timeouts per notebook in the shell script instead
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Count notebooks and estimate required time
echo "Number of notebooks to be executed:"
fd .ipynb$ notebooks/ | wc -l

Length of output: 535


Script:

#!/bin/bash
# Count the number of Jupyter notebooks
echo "Total number of notebooks:"
fd .ipynb$ notebooks/ | wc -l

Length of output: 98

197 changes: 104 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,111 +3,122 @@
[![Coverage Status](https://coveralls.io/repos/github/pyiron/executorlib/badge.svg?branch=main)](https://coveralls.io/github/pyiron/executorlib?branch=main)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/pyiron/executorlib/HEAD?labpath=notebooks%2Fexamples.ipynb)

## Challenges
In high performance computing (HPC) the Python programming language is commonly used as high-level language to
orchestrate the coupling of scientific applications. Still the efficient usage of highly parallel HPC clusters remains
challenging, in primarily three aspects:

* **Communication**: Distributing python function calls over hundreds of compute node and gathering the results on a
shared file system is technically possible, but highly inefficient. A socket-based communication approach is
preferable.
* **Resource Management**: Assigning Python functions to GPUs or executing Python functions on multiple CPUs using the
message passing interface (MPI) requires major modifications to the python workflow.
* **Integration**: Existing workflow libraries implement a secondary the job management on the Python level rather than
leveraging the existing infrastructure provided by the job scheduler of the HPC.

### executorlib is ...
In a given HPC allocation the `executorlib` library addresses these challenges by extending the Executor interface
of the standard Python library to support the resource assignment in the HPC context. Computing resources can either be
assigned on a per function call basis or as a block allocation on a per Executor basis. The `executorlib` library
is built on top of the [flux-framework](https://flux-framework.org) to enable fine-grained resource assignment. In
addition, [Simple Linux Utility for Resource Management (SLURM)](https://slurm.schedmd.com) is supported as alternative
queuing system and for workstation installations `executorlib` can be installed without a job scheduler.

### executorlib is not ...
The executorlib library is not designed to request an allocation from the job scheduler of an HPC. Instead within a given
allocation from the job scheduler the `executorlib` library can be employed to distribute a series of python
function calls over the available computing resources to achieve maximum computing resource utilization.

## Example
The following examples illustrates how `executorlib` can be used to distribute a series of MPI parallel function calls
within a queuing system allocation. `example.py`:
Up-scale python functions for high performance computing (HPC) with executorlib.

## Key Features
* **Up-scale your Python functions beyond a single computer.** - executorlib extends the [Executor interface](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
from the Python standard library and combines it with job schedulers for high performance computing (HPC) including
the [Simple Linux Utility for Resource Management (SLURM)](https://slurm.schedmd.com) and [flux](http://flux-framework.org).
With this combination executorlib allows users to distribute their Python functions over multiple compute nodes.
* **Parallelize your Python program one function at a time** - executorlib allows users to assign dedicated computing
resources like CPU cores, threads or GPUs to one Python function call at a time. So you can accelerate your Python
code function by function.
* **Permanent caching of intermediate results to accelerate rapid prototyping** - To accelerate the development of
machine learning pipelines and simulation workflows executorlib provides optional caching of intermediate results for
iterative development in interactive environments like jupyter notebooks.

## Examples
The Python standard library provides the [Executor interface](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
with the [ProcessPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor) and the
[ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) for parallel
execution of Python functions on a single computer. executorlib extends this functionality to distribute Python
functions over multiple computers within a high performance computing (HPC) cluster. This can be either achieved by
submitting each function as individual job to the HPC job scheduler - [HPC Submission Mode]() - or by requesting a
compute allocation of multiple nodes and then distribute the Python functions within this allocation - [HPC Allocation Mode]().
Finally, to accelerate the development process executorlib also provides a - [Local Mode]() - to use the executorlib
functionality on a single workstation for testing. Starting with the [Local Mode]() set by setting the backend parameter
to local - `backend="local"`:
```python
import flux.job
from executorlib import Executor


with Executor(backend="local") as exe:
future_lst = [exe.submit(sum, [i, i]) for i in range(1, 5)]
print([f.result() for f in future_lst])
```
In the same way executorlib can also execute Python functions which use additional computing resources, like multiple
CPU cores, CPU threads or GPUs. For example if the Python function internally uses the Message Passing Interface (MPI)
via the [mpi4py](https://mpi4py.readthedocs.io) Python libary:
```python
from executorlib import Executor


def calc(i):
from mpi4py import MPI

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
return i, size, rank

with flux.job.FluxExecutor() as flux_exe:
with Executor(max_cores=2, executor=flux_exe, resource_dict={"cores": 2}) as exe:
fs = exe.submit(calc, 3)
print(fs.result())
```
This example can be executed using:
```
python example.py
```
Which returns:
```
>>> [(0, 2, 0), (0, 2, 1)], [(1, 2, 0), (1, 2, 1)]
```
The important part in this example is that [mpi4py](https://mpi4py.readthedocs.io) is only used in the `calc()`
function, not in the python script, consequently it is not necessary to call the script with `mpiexec` but instead
a call with the regular python interpreter is sufficient. This highlights how `executorlib` allows the users to
parallelize one function at a time and not having to convert their whole workflow to use [mpi4py](https://mpi4py.readthedocs.io).
The same code can also be executed inside a jupyter notebook directly which enables an interactive development process.

The interface of the standard [concurrent.futures.Executor](https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures)
is extended by adding the option `cores_per_worker=2` to assign multiple MPI ranks to each function call. To create two
workers the maximum number of cores can be increased to `max_cores=4`. In this case each worker receives two cores
resulting in a total of four CPU cores being utilized.

After submitting the function `calc()` with the corresponding parameter to the executor `exe.submit(calc, 0)`
a python [`concurrent.futures.Future`](https://docs.python.org/3/library/concurrent.futures.html#future-objects) is
returned. Consequently, the `executorlib.Executor` can be used as a drop-in replacement for the
[`concurrent.futures.Executor`](https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures)
which allows the user to add parallelism to their workflow one function at a time.

## Disclaimer
While we try to develop a stable and reliable software library, the development remains a opensource project under the
BSD 3-Clause License without any warranties::

with Executor(backend="local") as exe:
fs = exe.submit(calc, 3, resource_dict={"cores": 2})
print(fs.result())
```
BSD 3-Clause License

Copyright (c) 2022, Jan Janssen
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The additional `resource_dict` parameter defines the computing resources allocated to the execution of the submitted
Python function. In addition to the compute cores `cores`, the resource dictionary can also define the threads per core
as `threads_per_core`, the GPUs per core as `gpus_per_core`, the working directory with `cwd`, the option to use the
OpenMPI oversubscribe feature with `openmpi_oversubscribe` and finally for the [Simple Linux Utility for Resource
Management (SLURM)](https://slurm.schedmd.com) queuing system the option to provide additional command line arguments
with the `slurm_cmd_args` parameter - [resource dictionary]().

This flexibility to assign computing resources on a per-function-call basis simplifies the up-scaling of Python programs.
Only the part of the Python functions which benefit from parallel execution are implemented as MPI parallel Python
funtions, while the rest of the program remains serial.

The same function can be submitted to the [SLURM](https://slurm.schedmd.com) queuing by just changing the `backend`
parameter to `slurm_submission`. The rest of the example remains the same, which highlights how executorlib accelerates
the rapid prototyping and up-scaling of HPC Python programs.
```python
from executorlib import Executor


def calc(i):
from mpi4py import MPI

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
return i, size, rank


with Executor(backend="slurm_submission") as exe:
fs = exe.submit(calc, 3, resource_dict={"cores": 2})
print(fs.result())
```
In this case the [Python simple queuing system adapter (pysqa)](https://pysqa.readthedocs.io) is used to submit the
`calc()` function to the [SLURM](https://slurm.schedmd.com) job scheduler and request an allocation with two CPU cores
for the execution of the function - [HPC Submission Mode](). In the background the [sbatch](https://slurm.schedmd.com/sbatch.html)
command is used to request the allocation to execute the Python function.

Within a given [SLURM](https://slurm.schedmd.com) allocation executorlib can also be used to assign a subset of the
available computing resources to execute a given Python function. In terms of the [SLURM](https://slurm.schedmd.com)
commands, this functionality internally uses the [srun](https://slurm.schedmd.com/srun.html) command to receive a subset
of the resources of a given queuing system allocation.
```python
from executorlib import Executor


# Documentation
def calc(i):
from mpi4py import MPI

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
return i, size, rank


with Executor(backend="slurm_allocation") as exe:
fs = exe.submit(calc, 3, resource_dict={"cores": 2})
print(fs.result())
```
In addition, to support for [SLURM](https://slurm.schedmd.com) executorlib also provides support for the hierarchical
[flux](http://flux-framework.org) job scheduler. The [flux](http://flux-framework.org) job scheduler is developed at
[Larwence Livermore National Laboratory](https://computing.llnl.gov/projects/flux-building-framework-resource-management)
to address the needs for the up-coming generation of Exascale computers. Still even on traditional HPC clusters the
hierarchical approach of the [flux](http://flux-framework.org) is beneficial to distribute hundreds of tasks within a
given allocation. Even when [SLURM](https://slurm.schedmd.com) is used as primary job scheduler of your HPC, it is
recommended to use [SLURM with flux]() as hierarchical job scheduler within the allocations.

## Documentation
* [Installation](https://executorlib.readthedocs.io/en/latest/installation.html)
* [Compatible Job Schedulers](https://executorlib.readthedocs.io/en/latest/installation.html#compatible-job-schedulers)
* [executorlib with Flux Framework](https://executorlib.readthedocs.io/en/latest/installation.html#executorlib-with-flux-framework)
Expand Down
5 changes: 5 additions & 0 deletions binder/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ dependencies:
- flux-pmix =0.5.0
- versioneer =0.28
- h5py =3.12.1
- matplotlib =3.9.2
- networkx =3.4.2
- pygraphviz =1.14
- pysqa =0.2.2
- ipython =8.29.0
6 changes: 4 additions & 2 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ format: jb-book
root: README
chapters:
- file: installation.md
- file: examples.ipynb
- file: development.md
- file: 1-local.ipynb
- file: 2-hpc-submission.ipynb
- file: 3-hpc-allocation.ipynb
- file: trouble_shooting.md
- file: 4-developer.ipynb
- file: api.rst
Loading
Loading