Skip to content

Commit

Permalink
Add stream wrapper types (#608)
Browse files Browse the repository at this point in the history
* Add stream wrapper type

* Update changelog

* Fix style

* Fix stream namespace in benchmark

* Apply suggestions from code review

Co-authored-by: Mark Harris <[email protected]>

* Rename stream_t to stream_view

* Add explicit conversion from stream_view to uintptr_t

* Add get_default_stream

* Rename stream_view to cuda_stream_view

* Revise documentation

* Add test for set_default_stream

* Rename stream.hpp to cuda_stream_view.hpp

* Fix merge

* Fixes for stream view on recent classes

* Changelog for #608

* Use cudaStreamLegacy and friends directly (not wrapped)

* Move cuda_stream_view.hpp out of detail folder

* Make stream view as constexpr and noexcept as possible

* Fix device_buffer to use cuda_stream_view

* Fix device_scalar constructor to use cuda_stream_view

* Add cuda_stream owning wrapper

* Update tests to use cuda_stream

* Remove extra include

* Remove explicit references to stream "0"

* Use unique_ptr to allow cuda_stream rule of zero

* default and constexpr all the things in cuda_stream_view

* Remove get/set_default_resource until needed.

* Explicit cudaStreamDefault

* No need to delete copy ctor/operator=

* Simplify defaults

* Remove changelog entry from old PR

* Add cuda_stream_view::synchronize()

* Add rmm_log.txt to .gitignore

* cuda_stream_view Cython

* Fix isort

* Lots of cleanup of stream classes and tests (C++)

* Rename .cu -> .cpp

* Cython CudaStreamView constructor and doc cleanup

* Fix isort

* Fix cast

* Implement operator!= in terms of operator==

Co-authored-by: Jake Hemstad <[email protected]>

* Improve CudaStreamView docs

Co-authored-by: Ashwin Srinath <[email protected]>

* Clean up PTDS

* Explicitly delete copy ctor / assignment

* More cython streamlining

* Python docs and style

* Disallow implicit conversion to cudaStream_t

* clean up casting in logs

* Add new and missing headers to meta.yml

* Update python/rmm/_lib/device_buffer.pyx

Co-authored-by: Keith Kraus <[email protected]>

* Update python/rmm/_lib/device_buffer.pyx

Co-authored-by: Keith Kraus <[email protected]>

* Update python/rmm/_lib/device_buffer.pyx

Co-authored-by: Keith Kraus <[email protected]>

* Update python/rmm/_lib/device_buffer.pyx

Co-authored-by: Keith Kraus <[email protected]>

* Update docs to use `cuda_stream_view`

* Delete the stream...

* Replace missed cudaStream_t parameters in device_uvector

* More missed cudaStream_t

* Replace cudaStream_t in multithreaded tests.

* No need to call get()

* Remove const& from stream view params

* Combine is_default and is_legacy_default.

* except *

* style

* Remove uses of `cudaStreamDefault` as a stream handle

* Use cuda_stream_view in replay benchmark and simplify logging_resource_adaptor.

Co-authored-by: Trevor Smith <[email protected]>
Co-authored-by: Jake Hemstad <[email protected]>
Co-authored-by: Ashwin Srinath <[email protected]>
Co-authored-by: Keith Kraus <[email protected]>
  • Loading branch information
5 people authored Oct 27, 2020
1 parent 210c863 commit 9b32765
Show file tree
Hide file tree
Showing 42 changed files with 841 additions and 384 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -144,3 +144,6 @@ ENV/

# mypy
.mypy_cache/

# RMM log files
rmm_log.txt
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## New Features

- PR #608 Add stream wrapper type

## Improvements

- PR #599 Make the arena memory resource work better with the producer/consumer mode
Expand Down
30 changes: 21 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,10 +185,10 @@ freeing device memory.

It has two key functions:

1. `void* device_memory_resource::allocate(std::size_t bytes, cudaStream_t s)`
1. `void* device_memory_resource::allocate(std::size_t bytes, cuda_stream_view s)`
- Returns a pointer to an allocation of at least `bytes` bytes.

2. `void device_memory_resource::deallocate(void* p, std::size_t bytes, cudaStream_t s)`
2. `void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)`
- Reclaims a previous allocation of size `bytes` pointed to by `p`.
- `p` *must* have been returned by a previous call to `allocate(bytes)`, otherwise behavior is
undefined
Expand All @@ -198,9 +198,21 @@ It is up to a derived class to provide implementations of these functions. See

Unlike `std::pmr::memory_resource`, `rmm::mr::device_memory_resource` does not allow specifying an
alignment argument. All allocations are required to be aligned to at least 256B. Furthermore,
`device_memory_resource` adds an additional `cudaStream_t` argument to allow specifying the stream
`device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream
on which to perform the (de)allocation.

## `cuda_stream_view` and `cuda_stream`

`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
which can lead to ambiguity in APIs when it is assigned `0`.) All RMM stream-ordered APIs take a
`rmm::cuda_stream_view` argument.

`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
can never represent the CUDA default stream or per-thread default stream, it only ever represents
a single non-default stream. `rmm::cuda_stream` cannot be copied but can be moved.

### Thread Safety

All current device memory resources are thread safe unless documented otherwise. More specifically,
Expand Down Expand Up @@ -335,11 +347,11 @@ An untyped, unintialized RAII class for stream ordered device memory allocation.
#### Example

```c++
cudaStream_t s;
cuda_stream_view s{...};
rmm::device_buffer b{100,s}; // Allocates at least 100 bytes on stream `s` using the *default* resource
void* p = b.data(); // Raw, untyped pointer to underlying device memory

kernel<<<..., s>>>(b.data()); // `b` is only safe to use on `s`
kernel<<<..., s.value()>>>(b.data()); // `b` is only safe to use on `s`

rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
rmm::device_buffer b2{100, s, mr}; // Allocates at least 100 bytes on stream `s` using the explicitly provided resource
Expand All @@ -353,9 +365,9 @@ contained elements. This optimization restricts the types `T` to trivially copya
#### Example
```c++
cudaStream_t s;
cuda_stream_view s{...};
rmm::device_uvector<int32_t> v(100, s); /// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the default resource
thrust::uninitialized_fill(thrust::cuda::par.on(s), v.begin(), v.end(), int32_t{0}); // Initializes the elements to 0
thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0}); // Initializes the elements to 0
rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
rmm::device_vector<int32_t> v2{100, s, mr}; // Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the explicitly provided resource
Expand All @@ -368,11 +380,11 @@ modifying the value in device memory from the host, or retrieving the value from

#### Example
```c++
cudaStream_t s;
cuda_stream_view s{...};
rmm::device_scalar<int32_t> a{s}; // Allocates uninitialized storage for a single `int32_t` in device memory
a.set_value(42, s); // Updates the value in device memory to `42` on stream `s`

kernel<<<...,s>>>(a.data()); // Pass raw pointer to underlying element in device memory
kernel<<<...,s.value()>>>(a.data()); // Pass raw pointer to underlying element in device memory

int32_t v = a.value(s); // Retrieves the value from device to host on stream `s`
```
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/device_uvector/device_uvector_bench.cu
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ static void BM_UvectorSizeConstruction(benchmark::State& state)
rmm::mr::set_current_device_resource(&mr);

for (auto _ : state) {
rmm::device_uvector<int32_t> vec(state.range(0), cudaStream_t{0});
rmm::device_uvector<int32_t> vec(state.range(0), rmm::cuda_stream_view{});
cudaDeviceSynchronize();
}

Expand Down
6 changes: 3 additions & 3 deletions benchmarks/random_allocations/random_allocations.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ void random_allocation_free(rmm::mr::device_memory_resource& mr,
SizeDistribution size_distribution,
size_t num_allocations,
size_t max_usage, // in MiB
cudaStream_t stream = 0)
rmm::cuda_stream_view stream = {})
{
std::default_random_engine generator;

Expand Down Expand Up @@ -139,7 +139,7 @@ void uniform_random_allocations(rmm::mr::device_memory_resource& mr,
size_t num_allocations,
size_t max_allocation_size, // in MiB
size_t max_usage,
cudaStream_t stream = 0)
rmm::cuda_stream_view stream = {})
{
std::uniform_int_distribution<std::size_t> size_distribution(1, max_allocation_size * size_mb);
random_allocation_free(mr, size_distribution, num_allocations, max_usage, stream);
Expand All @@ -151,7 +151,7 @@ void uniform_random_allocations(rmm::mr::device_memory_resource& mr,
size_t mean_allocation_size = 500, // in MiB
size_t stddev_allocation_size = 500, // in MiB
size_t max_usage = 8 << 20,
cudaStream_t stream) {
cuda_stream_view stream) {
std::normal_distribution<std::size_t> size_distribution(, max_allocation_size * size_mb);
}*/

Expand Down
3 changes: 1 addition & 2 deletions benchmarks/replay/replay.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -242,8 +242,7 @@ std::vector<std::vector<rmm::detail::event>> parse_per_thread_events(std::string
RMM_EXPECTS(std::all_of(all_events.begin(),
all_events.end(),
[](auto const& e) {
return (e.stream == cudaStreamDefault) or
(e.stream == reinterpret_cast<uintptr_t>(cudaStreamPerThread));
return e.stream.is_default() or e.stream.is_per_thread_default();
}),
"Non-default streams not currently supported.");

Expand Down
48 changes: 32 additions & 16 deletions benchmarks/utilities/log_parser.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include <rmm/mr/device/device_memory_resource.hpp>

#include "rapidcsv.h"
#include "rmm/cuda_stream_view.hpp"

#include <cstdint>
#include <iomanip>
Expand Down Expand Up @@ -50,25 +51,26 @@ struct event {

event(action a, std::size_t s, uintptr_t p) : act{a}, size{s}, pointer{p} {}

event(std::size_t tid, action a, std::size_t sz, uintptr_t p, uintptr_t s, std::size_t i)
event(
std::size_t tid, action a, std::size_t sz, uintptr_t p, rmm::cuda_stream_view s, std::size_t i)
: thread_id{tid}, act{a}, size{sz}, pointer{p}, stream{s}, index{i}
{
}

event(std::size_t tid, action a, std::size_t sz, void* p, uintptr_t s, std::size_t i)
event(std::size_t tid, action a, std::size_t sz, void* p, rmm::cuda_stream_view s, std::size_t i)
: event{tid, a, sz, reinterpret_cast<uintptr_t>(p), s, i}
{
}

friend std::ostream& operator<<(std::ostream& os, event const& e);

action act{}; ///< Indicates if the event is an allocation or a free
std::size_t size{}; ///< The size of the memory allocated or freed
uintptr_t pointer{}; ///< The pointer returned from an allocation, or the
///< pointer freed
std::size_t thread_id; ///< ID of the thread that initiated the event
uintptr_t stream; ///< Numeric representation of the CUDA stream on which the event occurred
std::size_t index; ///< Original ordering index of the event
action act{}; ///< Indicates if the event is an allocation or a free
std::size_t size{}; ///< The size of the memory allocated or freed
uintptr_t pointer{}; ///< The pointer returned from an allocation, or the
///< pointer freed
std::size_t thread_id; ///< ID of the thread that initiated the event
rmm::cuda_stream_view stream; ///< The CUDA stream on which the event occurred
std::size_t index; ///< Original ordering index of the event
};

inline std::ostream& operator<<(std::ostream& os, event const& e)
Expand Down Expand Up @@ -127,11 +129,25 @@ inline std::vector<event> parse_csv(std::string const& filename)
{
rapidcsv::Document csv(filename, rapidcsv::LabelParams(0, -1));

std::vector<std::size_t> tids = csv.GetColumn<std::size_t>("Thread");
std::vector<std::string> actions = csv.GetColumn<std::string>("Action");
std::vector<std::string> pointers = csv.GetColumn<std::string>("Pointer");
std::vector<std::size_t> sizes = csv.GetColumn<std::size_t>("Size");
std::vector<uintptr_t> streams = csv.GetColumn<uintptr_t>("Stream");
std::vector<std::size_t> tids = csv.GetColumn<std::size_t>("Thread");
std::vector<std::string> actions = csv.GetColumn<std::string>("Action");

auto parse_pointer = [](std::string const& s, uintptr_t& ptr) {
ptr = std::stoll(s, nullptr, 16);
};

std::vector<uintptr_t> pointers = csv.GetColumn<uintptr_t>("Pointer", parse_pointer);
std::vector<std::size_t> sizes = csv.GetColumn<std::size_t>("Size");

auto parse_stream = [](std::string const& s, rmm::cuda_stream_view& stream) {
cudaStream_t cs;
uintptr_t ls = std::stoll(s);
std::memcpy(&cs, &ls, sizeof(cudaStream_t));
stream = rmm::cuda_stream_view{cs};
};

std::vector<rmm::cuda_stream_view> streams =
csv.GetColumn<rmm::cuda_stream_view>("Stream", parse_stream);

auto const size_list = {tids.size(), actions.size(), pointers.size(), streams.size()};

Expand All @@ -146,10 +162,10 @@ inline std::vector<event> parse_csv(std::string const& filename)
auto const& a = actions[i];
RMM_EXPECTS((a == "allocate") or (a == "free"), "Invalid action string.");
auto act = (a == "allocate") ? action::ALLOCATE : action::FREE;
events[i] = event{tids[i], act, sizes[i], hex_string_to_int(pointers[i]), streams[i], i};
events[i] = event{tids[i], act, sizes[i], pointers[i], streams[i], i};
}
return events;
}
} // namespace detail

} // namespace detail
} // namespace rmm
6 changes: 3 additions & 3 deletions benchmarks/utilities/simulated_memory_resource.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ class simulated_memory_resource final : public device_memory_resource {
* @param bytes The size, in bytes, of the allocation
* @return void* Pointer to the newly allocated memory
*/
void* do_allocate(std::size_t bytes, cudaStream_t) override
void* do_allocate(std::size_t bytes, cuda_stream_view) override
{
RMM_EXPECTS(begin_ + bytes <= end_, rmm::bad_alloc, "Simulated memory size exceeded");
auto p = static_cast<void*>(begin_);
Expand All @@ -91,15 +91,15 @@ class simulated_memory_resource final : public device_memory_resource {
*
* @param p Pointer to be deallocated
*/
void do_deallocate(void* p, std::size_t, cudaStream_t) override {}
void do_deallocate(void* p, std::size_t, cuda_stream_view) override {}

/**
* @brief Get free and available memory for memory resource.
*
* @param stream to execute on.
* @return std::pair containing free_size and total_size of memory.
*/
std::pair<std::size_t, std::size_t> do_get_mem_info(cudaStream_t stream) const override
std::pair<std::size_t, std::size_t> do_get_mem_info(cuda_stream_view stream) const override
{
return std::make_pair(0, 0);
}
Expand Down
5 changes: 5 additions & 0 deletions conda/recipes/librmm/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,19 +34,24 @@ test:
commands:
- test -f $PREFIX/include/rmm/thrust_rmm_allocator.h
- test -f $PREFIX/include/rmm/logger.hpp
- test -f $PREFIX/include/rmm/cuda_stream.hpp
- test -f $PREFIX/include/rmm/cuda_stream_view.hpp
- test -f $PREFIX/include/rmm/device_uvector.hpp
- test -f $PREFIX/include/rmm/device_scalar.hpp
- test -f $PREFIX/include/rmm/device_buffer.hpp
- test -f $PREFIX/include/rmm/detail/aligned.hpp
- test -f $PREFIX/include/rmm/detail/error.hpp
- test -f $PREFIX/include/rmm/mr/device/detail/arena.hpp
- test -f $PREFIX/include/rmm/mr/device/detail/free_list.hpp
- test -f $PREFIX/include/rmm/mr/device/detail/coalescing_free_list.hpp
- test -f $PREFIX/include/rmm/mr/device/detail/fixed_size_free_list.hpp
- test -f $PREFIX/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/arena_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/binning_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/cuda_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/device_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/fixed_size_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/limiting_resource_adaptor.hpp
- test -f $PREFIX/include/rmm/mr/device/logging_resource_adaptor.hpp
- test -f $PREFIX/include/rmm/mr/device/managed_memory_resource.hpp
- test -f $PREFIX/include/rmm/mr/device/owning_wrapper.hpp
Expand Down
Loading

0 comments on commit 9b32765

Please sign in to comment.