Add stream wrapper types (#608)

* Add stream wrapper type * Update changelog * Fix style * Fix stream namespace in benchmark * Apply suggestions from code review Co-authored-by: Mark Harris <[email protected]> * Rename stream_t to stream_view * Add explicit conversion from stream_view to uintptr_t * Add get_default_stream * Rename stream_view to cuda_stream_view * Revise documentation * Add test for set_default_stream * Rename stream.hpp to cuda_stream_view.hpp * Fix merge * Fixes for stream view on recent classes * Changelog for #608 * Use cudaStreamLegacy and friends directly (not wrapped) * Move cuda_stream_view.hpp out of detail folder * Make stream view as constexpr and noexcept as possible * Fix device_buffer to use cuda_stream_view * Fix device_scalar constructor to use cuda_stream_view * Add cuda_stream owning wrapper * Update tests to use cuda_stream * Remove extra include * Remove explicit references to stream "0" * Use unique_ptr to allow cuda_stream rule of zero * default and constexpr all the things in cuda_stream_view * Remove get/set_default_resource until needed. * Explicit cudaStreamDefault * No need to delete copy ctor/operator= * Simplify defaults * Remove changelog entry from old PR * Add cuda_stream_view::synchronize() * Add rmm_log.txt to .gitignore * cuda_stream_view Cython * Fix isort * Lots of cleanup of stream classes and tests (C++) * Rename .cu -> .cpp * Cython CudaStreamView constructor and doc cleanup * Fix isort * Fix cast * Implement operator!= in terms of operator== Co-authored-by: Jake Hemstad <[email protected]> * Improve CudaStreamView docs Co-authored-by: Ashwin Srinath <[email protected]> * Clean up PTDS * Explicitly delete copy ctor / assignment * More cython streamlining * Python docs and style * Disallow implicit conversion to cudaStream_t * clean up casting in logs * Add new and missing headers to meta.yml * Update python/rmm/_lib/device_buffer.pyx Co-authored-by: Keith Kraus <[email protected]> * Update python/rmm/_lib/device_buffer.pyx Co-authored-by: Keith Kraus <[email protected]> * Update python/rmm/_lib/device_buffer.pyx Co-authored-by: Keith Kraus <[email protected]> * Update python/rmm/_lib/device_buffer.pyx Co-authored-by: Keith Kraus <[email protected]> * Update docs to use `cuda_stream_view` * Delete the stream... * Replace missed cudaStream_t parameters in device_uvector * More missed cudaStream_t * Replace cudaStream_t in multithreaded tests. * No need to call get() * Remove const& from stream view params * Combine is_default and is_legacy_default. * except * * style * Remove uses of `cudaStreamDefault` as a stream handle * Use cuda_stream_view in replay benchmark and simplify logging_resource_adaptor. Co-authored-by: Trevor Smith <[email protected]> Co-authored-by: Jake Hemstad <[email protected]> Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Keith Kraus <[email protected]>
rapidsai · Oct 27, 2020 · 9b32765 · 9b32765
1 parent 210c863
commit 9b32765
Show file tree

Hide file tree

Showing 42 changed files with 841 additions and 384 deletions.
diff --git a/.gitignore b/.gitignore
@@ -144,3 +144,6 @@ ENV/
 
 # mypy
 .mypy_cache/
+
+# RMM log files
+rmm_log.txt
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,8 @@
 
 ## New Features
 
+ - PR #608 Add stream wrapper type
+
 ## Improvements
 
 - PR #599 Make the arena memory resource work better with the producer/consumer mode

diff --git a/README.md b/README.md
@@ -185,10 +185,10 @@ freeing device memory.
 
 It has two key functions:
 
-1. `void* device_memory_resource::allocate(std::size_t bytes, cudaStream_t s)`
+1. `void* device_memory_resource::allocate(std::size_t bytes, cuda_stream_view s)`
    - Returns a pointer to an allocation of at least `bytes` bytes.
 
-2. `void device_memory_resource::deallocate(void* p, std::size_t bytes, cudaStream_t s)`
+2. `void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)`
    - Reclaims a previous allocation of size `bytes` pointed to by `p`. 
    - `p` *must* have been returned by a previous call to `allocate(bytes)`, otherwise behavior is
      undefined
@@ -198,9 +198,21 @@ It is up to a derived class to provide implementations of these functions. See
 
 Unlike `std::pmr::memory_resource`, `rmm::mr::device_memory_resource` does not allow specifying an 
 alignment argument. All allocations are required to be aligned to at least 256B. Furthermore, 
-`device_memory_resource` adds an additional `cudaStream_t` argument to allow specifying the stream
+`device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream
 on which to perform the (de)allocation.
 
+## `cuda_stream_view` and `cuda_stream`
+
+`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
+purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
+which can lead to ambiguity in APIs when it is assigned `0`.)  All RMM stream-ordered APIs take a 
+`rmm::cuda_stream_view` argument.
+
+`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides 
+RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
+can never represent the CUDA default stream or per-thread default stream, it only ever represents
+a single non-default stream. `rmm::cuda_stream` cannot be copied but can be moved.
+
 ### Thread Safety
 
 All current device memory resources are thread safe unless documented otherwise. More specifically,
@@ -335,11 +347,11 @@ An untyped, unintialized RAII class for stream ordered device memory allocation.
 #### Example
 
 ```c++
-cudaStream_t s;
+cuda_stream_view s{...};
 rmm::device_buffer b{100,s}; // Allocates at least 100 bytes on stream `s` using the *default* resource
 void* p = b.data();          // Raw, untyped pointer to underlying device memory
 
-kernel<<<..., s>>>(b.data()); // `b` is only safe to use on `s`
+kernel<<<..., s.value()>>>(b.data()); // `b` is only safe to use on `s`
 
 rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
 rmm::device_buffer b2{100, s, mr}; // Allocates at least 100 bytes on stream `s` using the explicitly provided resource
@@ -353,9 +365,9 @@ contained elements. This optimization restricts the types `T` to trivially copya
 #### Example
 
 ```c++
-cudaStream_t s;
+cuda_stream_view s{...};
 rmm::device_uvector<int32_t> v(100, s); /// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the default resource
-thrust::uninitialized_fill(thrust::cuda::par.on(s), v.begin(), v.end(), int32_t{0}); // Initializes the elements to 0
+thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0}); // Initializes the elements to 0
 
 rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
 rmm::device_vector<int32_t> v2{100, s, mr}; // Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the explicitly provided resource
@@ -368,11 +380,11 @@ modifying the value in device memory from the host, or retrieving the value from
 
 #### Example
 ```c++
-cudaStream_t s;
+cuda_stream_view s{...};
 rmm::device_scalar<int32_t> a{s}; // Allocates uninitialized storage for a single `int32_t` in device memory
 a.set_value(42, s); // Updates the value in device memory to `42` on stream `s`
 
-kernel<<<...,s>>>(a.data()); // Pass raw pointer to underlying element in device memory
+kernel<<<...,s.value()>>>(a.data()); // Pass raw pointer to underlying element in device memory
 
 int32_t v = a.value(s); // Retrieves the value from device to host on stream `s`
 ```

diff --git a/benchmarks/device_uvector/device_uvector_bench.cu b/benchmarks/device_uvector/device_uvector_bench.cu
@@ -30,7 +30,7 @@ static void BM_UvectorSizeConstruction(benchmark::State& state)
   rmm::mr::set_current_device_resource(&mr);
 
   for (auto _ : state) {
-    rmm::device_uvector<int32_t> vec(state.range(0), cudaStream_t{0});
+    rmm::device_uvector<int32_t> vec(state.range(0), rmm::cuda_stream_view{});
     cudaDeviceSynchronize();
   }
 

diff --git a/benchmarks/random_allocations/random_allocations.cpp b/benchmarks/random_allocations/random_allocations.cpp
@@ -62,7 +62,7 @@ void random_allocation_free(rmm::mr::device_memory_resource& mr,
                             SizeDistribution size_distribution,
                             size_t num_allocations,
                             size_t max_usage,  // in MiB
-                            cudaStream_t stream = 0)
+                            rmm::cuda_stream_view stream = {})
 {
   std::default_random_engine generator;
 
@@ -139,7 +139,7 @@ void uniform_random_allocations(rmm::mr::device_memory_resource& mr,
                                 size_t num_allocations,
                                 size_t max_allocation_size,  // in MiB
                                 size_t max_usage,
-                                cudaStream_t stream = 0)
+                                rmm::cuda_stream_view stream = {})
 {
   std::uniform_int_distribution<std::size_t> size_distribution(1, max_allocation_size * size_mb);
   random_allocation_free(mr, size_distribution, num_allocations, max_usage, stream);
@@ -151,7 +151,7 @@ void uniform_random_allocations(rmm::mr::device_memory_resource& mr,
                                 size_t mean_allocation_size = 500, // in MiB
                                 size_t stddev_allocation_size = 500, // in MiB
                                 size_t max_usage = 8 << 20,
-                                cudaStream_t stream) {
+                                cuda_stream_view stream) {
   std::normal_distribution<std::size_t> size_distribution(, max_allocation_size * size_mb);
 }*/
 

diff --git a/benchmarks/replay/replay.cpp b/benchmarks/replay/replay.cpp
@@ -242,8 +242,7 @@ std::vector<std::vector<rmm::detail::event>> parse_per_thread_events(std::string
   RMM_EXPECTS(std::all_of(all_events.begin(),
                           all_events.end(),
                           [](auto const& e) {
-                            return (e.stream == cudaStreamDefault) or
-                                   (e.stream == reinterpret_cast<uintptr_t>(cudaStreamPerThread));
+                            return e.stream.is_default() or e.stream.is_per_thread_default();
                           }),
               "Non-default streams not currently supported.");
 

diff --git a/benchmarks/utilities/log_parser.hpp b/benchmarks/utilities/log_parser.hpp
@@ -21,6 +21,7 @@
 #include <rmm/mr/device/device_memory_resource.hpp>
 
 #include "rapidcsv.h"
+#include "rmm/cuda_stream_view.hpp"
 
 #include <cstdint>
 #include <iomanip>
@@ -50,25 +51,26 @@ struct event {
 
   event(action a, std::size_t s, uintptr_t p) : act{a}, size{s}, pointer{p} {}
 
-  event(std::size_t tid, action a, std::size_t sz, uintptr_t p, uintptr_t s, std::size_t i)
+  event(
+    std::size_t tid, action a, std::size_t sz, uintptr_t p, rmm::cuda_stream_view s, std::size_t i)
     : thread_id{tid}, act{a}, size{sz}, pointer{p}, stream{s}, index{i}
   {
   }
 
-  event(std::size_t tid, action a, std::size_t sz, void* p, uintptr_t s, std::size_t i)
+  event(std::size_t tid, action a, std::size_t sz, void* p, rmm::cuda_stream_view s, std::size_t i)
     : event{tid, a, sz, reinterpret_cast<uintptr_t>(p), s, i}
   {
   }
 
   friend std::ostream& operator<<(std::ostream& os, event const& e);
 
-  action act{};           ///< Indicates if the event is an allocation or a free
-  std::size_t size{};     ///< The size of the memory allocated or freed
-  uintptr_t pointer{};    ///< The pointer returned from an allocation, or the
-                          ///< pointer freed
-  std::size_t thread_id;  ///< ID of the thread that initiated the event
-  uintptr_t stream;       ///< Numeric representation of the CUDA stream on which the event occurred
-  std::size_t index;      ///< Original ordering index of the event
+  action act{};                  ///< Indicates if the event is an allocation or a free
+  std::size_t size{};            ///< The size of the memory allocated or freed
+  uintptr_t pointer{};           ///< The pointer returned from an allocation, or the
+                                 ///< pointer freed
+  std::size_t thread_id;         ///< ID of the thread that initiated the event
+  rmm::cuda_stream_view stream;  ///< The CUDA stream on which the event occurred
+  std::size_t index;             ///< Original ordering index of the event
 };
 
 inline std::ostream& operator<<(std::ostream& os, event const& e)
@@ -127,11 +129,25 @@ inline std::vector<event> parse_csv(std::string const& filename)
 {
   rapidcsv::Document csv(filename, rapidcsv::LabelParams(0, -1));
 
-  std::vector<std::size_t> tids     = csv.GetColumn<std::size_t>("Thread");
-  std::vector<std::string> actions  = csv.GetColumn<std::string>("Action");
-  std::vector<std::string> pointers = csv.GetColumn<std::string>("Pointer");
-  std::vector<std::size_t> sizes    = csv.GetColumn<std::size_t>("Size");
-  std::vector<uintptr_t> streams    = csv.GetColumn<uintptr_t>("Stream");
+  std::vector<std::size_t> tids    = csv.GetColumn<std::size_t>("Thread");
+  std::vector<std::string> actions = csv.GetColumn<std::string>("Action");
+
+  auto parse_pointer = [](std::string const& s, uintptr_t& ptr) {
+    ptr = std::stoll(s, nullptr, 16);
+  };
+
+  std::vector<uintptr_t> pointers = csv.GetColumn<uintptr_t>("Pointer", parse_pointer);
+  std::vector<std::size_t> sizes  = csv.GetColumn<std::size_t>("Size");
+
+  auto parse_stream = [](std::string const& s, rmm::cuda_stream_view& stream) {
+    cudaStream_t cs;
+    uintptr_t ls = std::stoll(s);
+    std::memcpy(&cs, &ls, sizeof(cudaStream_t));
+    stream = rmm::cuda_stream_view{cs};
+  };
+
+  std::vector<rmm::cuda_stream_view> streams =
+    csv.GetColumn<rmm::cuda_stream_view>("Stream", parse_stream);
 
   auto const size_list = {tids.size(), actions.size(), pointers.size(), streams.size()};
 
@@ -146,10 +162,10 @@ inline std::vector<event> parse_csv(std::string const& filename)
     auto const& a = actions[i];
     RMM_EXPECTS((a == "allocate") or (a == "free"), "Invalid action string.");
     auto act  = (a == "allocate") ? action::ALLOCATE : action::FREE;
-    events[i] = event{tids[i], act, sizes[i], hex_string_to_int(pointers[i]), streams[i], i};
+    events[i] = event{tids[i], act, sizes[i], pointers[i], streams[i], i};
   }
   return events;
-}
+}  // namespace detail
 
 }  // namespace detail
 }  // namespace rmm
diff --git a/benchmarks/utilities/simulated_memory_resource.hpp b/benchmarks/utilities/simulated_memory_resource.hpp
@@ -74,7 +74,7 @@ class simulated_memory_resource final : public device_memory_resource {
    * @param bytes The size, in bytes, of the allocation
    * @return void* Pointer to the newly allocated memory
    */
-  void* do_allocate(std::size_t bytes, cudaStream_t) override
+  void* do_allocate(std::size_t bytes, cuda_stream_view) override
   {
     RMM_EXPECTS(begin_ + bytes <= end_, rmm::bad_alloc, "Simulated memory size exceeded");
     auto p = static_cast<void*>(begin_);
@@ -91,15 +91,15 @@ class simulated_memory_resource final : public device_memory_resource {
    *
    * @param p Pointer to be deallocated
    */
-  void do_deallocate(void* p, std::size_t, cudaStream_t) override {}
+  void do_deallocate(void* p, std::size_t, cuda_stream_view) override {}
 
   /**
    * @brief Get free and available memory for memory resource.
    *
    * @param stream to execute on.
    * @return std::pair containing free_size and total_size of memory.
    */
-  std::pair<std::size_t, std::size_t> do_get_mem_info(cudaStream_t stream) const override
+  std::pair<std::size_t, std::size_t> do_get_mem_info(cuda_stream_view stream) const override
   {
     return std::make_pair(0, 0);
   }

diff --git a/conda/recipes/librmm/meta.yaml b/conda/recipes/librmm/meta.yaml
@@ -34,19 +34,24 @@ test:
   commands:
     - test -f $PREFIX/include/rmm/thrust_rmm_allocator.h
     - test -f $PREFIX/include/rmm/logger.hpp
+    - test -f $PREFIX/include/rmm/cuda_stream.hpp
+    - test -f $PREFIX/include/rmm/cuda_stream_view.hpp
     - test -f $PREFIX/include/rmm/device_uvector.hpp
     - test -f $PREFIX/include/rmm/device_scalar.hpp
     - test -f $PREFIX/include/rmm/device_buffer.hpp
     - test -f $PREFIX/include/rmm/detail/aligned.hpp
     - test -f $PREFIX/include/rmm/detail/error.hpp
+    - test -f $PREFIX/include/rmm/mr/device/detail/arena.hpp
     - test -f $PREFIX/include/rmm/mr/device/detail/free_list.hpp
     - test -f $PREFIX/include/rmm/mr/device/detail/coalescing_free_list.hpp
     - test -f $PREFIX/include/rmm/mr/device/detail/fixed_size_free_list.hpp
     - test -f $PREFIX/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
+    - test -f $PREFIX/include/rmm/mr/device/arena_memory_resource.hpp
     - test -f $PREFIX/include/rmm/mr/device/binning_memory_resource.hpp
     - test -f $PREFIX/include/rmm/mr/device/cuda_memory_resource.hpp
     - test -f $PREFIX/include/rmm/mr/device/device_memory_resource.hpp
     - test -f $PREFIX/include/rmm/mr/device/fixed_size_memory_resource.hpp
+    - test -f $PREFIX/include/rmm/mr/device/limiting_resource_adaptor.hpp
     - test -f $PREFIX/include/rmm/mr/device/logging_resource_adaptor.hpp
     - test -f $PREFIX/include/rmm/mr/device/managed_memory_resource.hpp
     - test -f $PREFIX/include/rmm/mr/device/owning_wrapper.hpp