-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting some optimization cases to run on GPU without UVM #1086
base: master
Are you sure you want to change the base?
Conversation
…ds for separableScatterScalarResponse
…ScatterScalarResponse
…OFManager to avoid excess transfers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! But I have a few doubts and questions.
const auto elem_LID = elem_lids(cell); | ||
const auto p_dof_lids = Kokkos::subview(p_elem_dof_lids,elem_LID,ALL); | ||
for (int node=0; node<num_deriv; ++node) { | ||
const LO lid = p_dof_lids(node); | ||
|
||
// Initialize Fad type for parameter value | ||
const auto p_val = lid>=0 ? p_data[lid] : 0; | ||
const auto p_val = lid>=0 ? p_data(lid) : 0; | ||
ParamScalarT v(num_deriv, node, p_val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on device, and is it performant? With certain fads, doesn't it allocate memory on device at every call?
@@ -330,6 +350,18 @@ evaluateFields(typename Traits::EvalData workset) | |||
const int neq = sol_dof_mgr->getNumFields(); | |||
const int num_deriv = this->numNodes; | |||
const bool trans = workset.transpose_dist_param_deriv; | |||
|
|||
if (Vp != Teuchos::null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems off here.
const int cell = sideSet.ws_elem_idx.h_view(sideSet_idx); | ||
const int cell = sideSet.ws_elem_idx.d_view(sideSet_idx); | ||
|
||
ScalarT diff_1[8] = {0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case of an automatic tmp fad inside a kernel: are these ok? For SFad and SLFad, I think the answer is yes, but for generic Fad, I'm not sure. Someone with more Fad knowledge than me may know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. With DFad it will create temporaries, and it could run out of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an issue with dfad on gpu but we don't use dfad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do allow DFad during config though. If we don't want to allow DFad, we should make it clear and throw during config time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not concerned about running out of memory. I'm more concerned with doing lots of small Cuda allocations at run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense. we should just set default slfad values (that work with testing) when running with cuda/hip/sycl instead of having those lines in every albany config. And error out if they are explicitly set to dfad.
when using sfad/slfad, they should be static allocations in local memory/registers. worse case, if the memory spills, it should be as performant as a read/write to global memory but not as bad as a device malloc, which is what's attempted with dfad (unless we setup the memory pool thing).
my concern would actually be, how large is the derivative dimension in this scalar? I forgot we're talking about optimization which could have a lot of derivative components... in which case, we may run out of memory...
this->global_response_eval(0) += sum*scaling; | ||
} | ||
this->local_response_eval(cell,0) = sum*scaling; | ||
KU::atomic_add<ExecutionSpace>(&(this->global_response_eval(0)), sum*scaling); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to use parallel_reduce, rather than a parallel_for with atomic access? All threads access the same value, so contention is very high here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should definitely be a reduce. I ported this early on when I was just trying to get something to work. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartgol Is there a trick to doing a parallel_reduce with Sacado FAD types? I created a very simple example program based on https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Custom-Reductions-Built-In-Reducers-with-Custom-Scalar-Types.html and I'm getting compiler errors:
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/sacado/example/my_custom_reduce_example.cpp(35): error: no instance of constructor "Kokkos::View<DataType, Properties...>::View [with DataType=ScalarT *, Properties=<Kokkos::CudaSpace, Kokkos::MemoryUnmanaged>]" matches the argument list
argument types are: (ScalarT *, int)
detected during:
instantiation of "SumScalarT<ScalarT, Space>::result_view_type SumScalarT<ScalarT, Space>::view() const [with ScalarT=ScalarT, Space=Kokkos::CudaSpace]"
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1525): here
instantiation of "void Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute_impl(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>, FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>]"
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1542): here
instantiation of "std::enable_if_t<<expression>, void> Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>, FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>, Dummy=SumScalarT<ScalarT, Kokkos::CudaSpace>]"
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1798): here
instantiation of "std::enable_if_t<<expression>, void> Kokkos::parallel_reduce(const size_t &, const FunctorType &, const ReturnType &) [with FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>]"
(51): here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I am not sure. Maybe @etphipp has some advice.
This PR does the following:
I'm going to do some performance profiling to get an idea of what impact these changes have on performance but the code is ready for review in the meantime.