Releases: SC-SGS/CPPuddle
Release 0.3.1
Description
This is mostly a bugfix release:
- Fixed executor reference counting in work aggregation areas. This enables CPU/GPU load balancing again (mostly useful in consumer-grade hardware).
- Fixed aggregation mutex choice (should be
hpx::mutex
). Usehpx::mutex
by default everywhere else now as well (thoughstd::mutex
remains a valid option here). - Added an option to turn off executor pools whilst still providing the same interface (useful for performance comparisons).
What's Changed
- Update README.md by @G-071 in #23
- Fix combined CPU GPU execution by @G-071 in #24
- Add option to disable using the executor pool by @G-071 in #25
- Change mutex defaults by @G-071 in #26
Full Changelog: v0.3.0...v0.3.1
Release 0.3.0
Description
This release contains a refactored/overhauled buffer management core and adds proper MultiGPU support.
Feature list / Changelog:
-
CPPuddle is now usable as a header-only library.
-
Reworked buffer manager by adding an HPX-aware mode and variable internal buckets. This mode uses the OS thread ID as a hint to reduce locking and get buffers for the correct NUMA node.
-
Added cmake variable to steer the number of internally used buckets (tradeoff between speed and memory usage).
-
Repaired and added MultiGPU functionality (also works for the work aggregation executors / allocators).
-
Removed central reference counting for recycled Kokkos buffer (now per View counting).
-
Added proper finalize method which prevents further usage after being called.
-
Added cmake toggles to enable/disable content recycling and buffer recycling as required (useful for benchmarking).
-
Made the internal CPPuddle allocation/recycling counters compatible with HPX performance counters.
-
Contains various bug fixes and a cleaned up codebase.
Note: The MultiGPU addition required some slight adjustment to the interface, requiring additional device_id parameters for various functions. Additionally, some gpu_id parameters from the defunct previous MultiGPU code have been removed. Other than that, the interface largely stayed the same.
What's Changed
Full Changelog: v0.2.0...v0.3.0
Release 0.2.1
Description
This release backports the interface changes from v0.3.0 to the older v0.2.0 release.
Feature list / Changelog:
-
Backports the interface changes from Release v0.3.0 to v0.2.0, effectively allowing applications such as Octo-Tiger to still use the old CPPuddle core (from 0.2.0) despite having been ported to the new CPPuddle interface (from 0.3.0).
-
Notably, the interface was backported for v0.2.1 in a way that keeps this release compatible with the interface of previous CPPuddle releases (which was not feasible for 0.3.0 due to the removal of the old MultiGPU code).
-
The release further fixes some small test issues
Full Changelog: v0.2.0...v0.2.1
Release 0.2.0
Description
This release adds work aggregation/kernel fusion features, SYCL support and A64FX support:
- Added explicit work aggregation executors and allocators. These allow multithreaded work aggregation / kernel fusion of GPU kernels when using HPX. They are intended to combine GPU kernels on-the-fly that are doing the same work but on different HPX components (same HPX locality though). See here more detailed description and benchmarks with a real-world HPX application using both an NVIDIA A100 and an AMD MI100 (using CUDA, HIP and Kokkos)
- Added basic tests for CPU work aggregation executor/allocators
- Added more detailed CPU/GPU STREAM tests for work aggregation executor/allocators
- Added SYCL allocators (used for the benchmarks here)
- Fixed various CI bugs and compilation on A64Fx (see here for usage example on A64Fx machines)
- Note: Including the work aggregation executor/allocators requires C++17, other features still work with C++14.
Pull requests
- Work aggregation experimental by @G-071 in #12
- Remove superfluous cuda header by @G-071 in #13
- Allow arbitrary de-allocation ordering in aggregation areas by @G-071 in #14
- Add sycl allocators by @G-071 in #15
- Remove flag requirement by @G-071 in #18
- Fix jenkins by @G-071 in #19
- Added view type by @G-071 in #20
- Fix compilation error on Ookami by @JiakunYan in #17
New Contributors
- @JiakunYan made their first contribution in #17
Full Changelog: v0.1.0...v0.2.0
Release 0.1.0
The version for this release has been in use for multiple months now and seems to work well, hence this initial release with the basic functionality before more experimental features are added!
The release contains the basic (multithreaded) recycling / reusage functionality for buffers and executors:
It provides allocators that enable
- Reusage of buffers allocated by std::allocator
- Reusage of aligned buffers
- Reusage of CUDA device memory buffers
- Reusage of CUDA pinned host memory buffer
- Reusage of HIP device memory buffers
- Reusage of HIP pinned host memory buffer
- Reusage of Kokkos Views (via wrapper class)
It further provides executors pools for arbitrary executor with various scheduling policies (tested with HPX CUDA/HIP and Kokkos executors)
- Round robin scheduling policy
- Priority scheduling policy
- MultiGPU with Round Robin scheduling policy
- MultiGPU with Priority scheduling policy
The release also includes CI functionality on GitHub actions and Jenkins (for GPU and concurrency tests).