diff --git a/README.md b/README.md index 980c4cc0..9e3adb32 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,5 @@ ### CPPuddle -WARNING: This repository is a work in progress and should not be relied on for production use! - [![ctest](https://github.com/SC-SGS/CPPuddle/actions/workflows/cmake.yml/badge.svg)](https://github.com/SC-SGS/CPPuddle/actions/workflows/cmake.yml) [![Build Status](https://simsgs.informatik.uni-stuttgart.de/jenkins/buildStatus/icon?job=CPPuddle%2Fmaster&config=allbuilds)](https://simsgs.informatik.uni-stuttgart.de/jenkins/view/Octo-Tiger%20and%20Dependencies/job/CPPuddle/job/master/) @@ -11,13 +9,13 @@ WARNING: This repository is a work in progress and should not be relied on for p This repository was initially created to explore how to best use HPX and Kokkos together! For fine-grained GPU tasks, we needed a way to avoid excessive allocations of one-usage GPU buffers (as allocations block the device for all streams) and creation/deletion of GPU executors (as those are usually tied to a stream which is expensive to create as well). -We currently test it in the experimental build of [Octo-Tiger](https://github.com/STEllAR-GROUP/octotiger), together with [HPX-Kokkos](https://github.com/STEllAR-GROUP/hpx-kokkos). +We currently test/use CPPuddle in [Octo-Tiger](https://github.com/STEllAR-GROUP/octotiger), together with [HPX-Kokkos](https://github.com/STEllAR-GROUP/hpx-kokkos). In this use-case, allocating GPU buffers for all sub-grids in advance would have wasted a lot of memory. On the other hand, unified memory would have caused unnecessary GPU to CPU page migrations (as the old input data gets overwritten anyway). Allocating buffers on-the-fly would have blocked the device. Hence, we currently test this buffer management solution! #### Tools provided by this repository -- Allocators that reuse previousely allocated buffers if available (works with normal heap memory, pinned memory, aligned memory, CUDA device memory, and Kokkos Views). Note that separate buffers do not coexist on a single chunk of continuous memory, but use different allocations. -- Executor pools and various scheduling policies (round robin, priority queue, multi-gpu), which rely on reference counting to gauge the current load of a executor instead of querying the device itself. +- Allocators that reuse previousely allocated buffers if available (works with normal heap memory, pinned memory, aligned memory, CUDA/HIP device memory, and Kokkos Views). Note that separate buffers do not coexist on a single chunk of continuous memory, but use different allocations. +- Executor pools and various scheduling policies (round robin, priority queue, multi-gpu), which rely on reference counting to gauge the current load of a executor instead of querying the device itself. Tested with CUDA, HIP and Kokkos executors provided by HPX / HPX-Kokkos. #### Requirements