Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise RAJA plugin support #1742

Open
adayton1 opened this issue Sep 24, 2024 · 15 comments
Open

Revise RAJA plugin support #1742

adayton1 opened this issue Sep 24, 2024 · 15 comments
Assignees
Labels
API/usability Enhancement reviewed Mark with this label when issue has been discussed by team

Comments

@adayton1
Copy link
Member

Is your feature request related to a problem? Please describe.

RAJA plugins are used by CHAI to make sure the data backing ManagedArray is in the correct memory space and that it is up to date. However, the approach used now is not stream aware. This leads to suboptimal performance on GPU platforms. Where there is a dual memory space (CUDA), memory copies to the host are done on stream 0, which forces the whole device to synchronize. Where there is a single memory space (HIP), we have to do a synchronize across the whole device to make sure the data is valid during host accesses.

Describe the solution you'd like

Making CHAI stream aware would be relatively straightforward if the camp resource used by RAJA was passed as an argument to the plugin functions. Additionally, the postLaunch function should also receive an event with a wait method that CHAI can call when it needs to be sure the kernel has been completed.

Describe alternatives you've considered

Instead of modifying the plugin, RAJA could set some global state that is accessible when the plugin methods are called.

Additional context

Umpire is working on camp resource aware allocators (LLNL/Umpire#901), which CHAI will also be using.

Also, note that even if only one stream is being used in an application, this new approach will be more efficient than synchronizing across the whole device.

@adayton1
Copy link
Member Author

Actually, passing an event to postLaunch might need some more design work. For preCapture, we set some global state that is then used by chai::ManagedArray's copy constructor. The global state is then reset in postCapture. I'm imagining analogous preDestroy and postDestroy functions that set and reset global state and chai::ManagedArray's destructor would operate off that global state. That could get tricky to do, though.

I'm also trying to re-evaluate if I can do some kind of registry pattern where chai::ManagedArrays register themselves or some callbacks and then the postLaunch could do the work without having to set some global state.

@adayton1
Copy link
Member Author

I've been operating on the assumption that an event can be waited on more than once. Is that true?

@MrBurmark
Copy link
Member

MrBurmark commented Sep 25, 2024

I think you can wait multiple times on an event. Note that we will also have to be careful using events as camp never frees/destroys the underlying cuda/hip event currently. I've thought about making events move-only or having a shared pointer so they can automatically clean up after themselves. I would prefer move-only as shared pointer can be expensive, but it would be a breaking change in the API. @long58

@rhornung67 rhornung67 added API/usability Enhancement reviewed Mark with this label when issue has been discussed by team labels Oct 15, 2024
@rhornung67
Copy link
Member

Target this for next RAJA Suite release.

@rhornung67
Copy link
Member

Talk to ALED and Spheral teams about move-only vs. reference counting for event tracking.

@adayton1
Copy link
Member Author

adayton1 commented Oct 22, 2024

We would need reference counting in events since multiple CHAI ManagedArrays could track the same event.

@adayton1
Copy link
Member Author

If you wanted to avoid passing an event on the postLaunch method, you could have RAJA::forall avoid returning an event and allow the application to create their own event if desired (since they passed in the resource to begin with, they can easily create one).

@MrBurmark
Copy link
Member

We return something this is convertible to an event from the forall, so if you don't use it then no event is created.

@adayton1
Copy link
Member Author

We return something this is convertible to an event from the forall, so if you don't use it then no event is created.

Nice! I guess we'll still have to be careful about both CHAI and the application creating separate events. Or is the overhead of creating an event pretty low?

@MrBurmark
Copy link
Member

That is still a concern, we'd have to measure. I don't think many apps use events.

@MrBurmark
Copy link
Member

Thinking about things like multiple ownership for events. I can imagine having events that contain a shared pointer to their contents or events that are move-only and those wanting multiple ownership semantics wrap them in a shared pointer.

@MrBurmark
Copy link
Member

MrBurmark commented Oct 29, 2024

Right now generic resources always come with the overhead of a shared pointer. I've been lamenting that for a while. If we wanted to avoid this we could perhaps have something like shared_resource and shared_event classes and unique_resource and unique_event classes? That way you could choose as a user which semantics you want. Though we'd still have to decide what semantics the typed resources and events should have. Currently typed resources and events are basically views to a stream or event so they can be copied and destroyed without affecting the underlying stream or event. The lifetime of the underlying resources, cuda/hip streams and events, are basically static storage duration and none of them are ever freed.

@adayton1
Copy link
Member Author

We'll definitely need events to be cleaned up - in a single run I could easily see tens of thousands of them being created. I'm less worried about streams at the moment, though that's still good to consider.

@adayton1
Copy link
Member Author

adayton1 commented Oct 29, 2024

Hmm, I'm not sure how to handle typed resources/events. I don't love the idea of a typed resource destroying a stream when it goes out of scope. Or for that matter, a typed event being waited on when the typed event goes out of scope. Though, if it didn't wait on the event, just released it (is that possible to do without waiting on it?), then that seems reasonable.

@MrBurmark
Copy link
Member

MrBurmark commented Oct 31, 2024

I don't think you have to wait on events to free them in cuda/hip.

@rhornung67 rhornung67 added this to the FY25 Development milestone Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API/usability Enhancement reviewed Mark with this label when issue has been discussed by team
Projects
None yet
Development

No branches or pull requests

4 participants