-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to use thrust::reduce on any. #685
base: main
Are you sure you want to change the base?
Conversation
include/matx/operators/any.h
Outdated
inp_ptr + a_.TotalSize(), | ||
op.Init(), | ||
op); | ||
*result_ptr = result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not really that faimliar with MatX. on line 79 I am doing a device_pointer_cast
- IIUC the Executor
is what determines if it's on host or device. I changed Executor ex
to Executor
to get past the warning as error that it wasn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZelboK, the Exec
functions take any generic operator and calls the transform function. In this case there's a separate transform overload for a CUDA executor and a host executor. Since the input can be any type of operator and not just a tensor, there may not be a Data()
method since it doesn't have to have memory backing it. For example, a user could do:
(a = any(ones<int>({4,4}))).run();
ones
has no Data
function since it has no memory backing it. So for this to work properly it would have to use thrust's iterator interface and wrap the operator in that as we do in other transforms. This is not a trivial change and can be a bit difficult if you're not familiar with the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliffburdick I've contributed a bit to CCCL actually so I think I should be fine in that regard. That makes sense to me so far, I'll have to try and actually implement it to see how it goes. I presume you are referring to make_reduce_iterator
for example. Curious to know why this isn't trivial though, is there something I'm missing?
Tangentially, is there a way to drastically reduce compile times? The feedback loop right now takes quite a long time.
I run with these options
cmake -DMATX_BUILD_TESTS=ON -DMATX_BUILD_BENCHMARKS=OFF -DMATX_BUILD_EXAMPLES=OFF -DMATX_BUILD_DOCS=OFF ..
and have tried commenting out tests but it still takes a long time before I actually get to see errors from the compiler. I have a pretty decent CPU (i9 12900k) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZelboK, it might not be too hard then if you're familiar with CCCL and their iterators. We have our own iterator classes in iterator.h. Typically the way we use them is like this:
https://github.com/NVIDIA/MatX/blob/main/include/matx/transforms/cub.h#L720
We write a lambda to perform the function (CUB in this case), and ReduceInput
wraps it in iterators and collapses it. The tricky part might be that we have not done this with thrust though, so I don't know if our iterators are missing something to get it to work there.
To reduce compile times you shouldn't build everything each time. What I do is I take an existing example like fft_conv.cu (or make a new one), put my code in there that I'm testing, and compile just that with something like make fft_conv
. That should compile in just about 10 seconds on most machines, whereas compiling everything can approach an hour on weaker machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cliffburdick , I test out my code in an example now and the feedback loop is much better. Much appreciated.
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T Reduce(const T &v1, const T &v2) | ||
{ | ||
|
||
__MATX_HOST__ __MATX_DEVICE__ __MATX_INLINE__ T operator()(const T &v1, const T &v2) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be const for Thrust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZelboK , thanks for the PR! I left a comment that outlines the basic summary. This change may require significant work since it has to use iterators and we currently have no existing operator that uses thrust as an example.
…od. Refactor so that the operator_utils is capable of dispatching to thrust
include/matx/operators/any.h
Outdated
output_t out_base = output_; | ||
auto op = detail::reduceOpAny<value_t>(); | ||
|
||
auto rv = ReduceInputThrust(std::forward<OpA>(a_), std::forward<out_t>(output_), std::forward<decltype(op)>(op)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We construct the op
, and then send the inputs and outputs in their original form to the new function ReduceInputThrust
and of which gets collapsed and dispatched to thrust accordingly.
include/matx/core/operator_utils.h
Outdated
if (in_base.IsContiguous()) { | ||
// the conversion is already handled for us by RandomOperatorIterator | ||
thrust::reduce( | ||
iter + *begin, iter + *end, op.Init(), op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this function is responsible for dealing with the facct that In
may not necessarily be a tensor_t
but rather an operator of some sort. It'll get the respective offsets and construct an iterator for thrust to use, which, thankfully, does seem to be perfectly compatible.
Consequentially it would seem the code has also become potentially simpler than it's counterpart on main
. Thrust is now responsible for deciding whom exactly to use in CUB rather than MatX. I might be missing some context though on whether or not MatX needing to be responsible for what function to use in CUB/Thrust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight, I realize now that there might be issues with this approach. Does matXBinaryOp
for example need to be utilized? I see that it has methods like PreRun
PostRun
etc that make use of the Executor(which I have used here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZelboK since a tensor is an operator, the iterator wrapper can turn any MatX operator type into an iterator. However, we do the contiguous check to allow CUB/thrust an optimization if it's a flat pointer with contiguous strides.
Regarding your second comment, the ReduceInput
function shouldn't/doesn't need to know whether it's CUB or thrust. That's something you pass in your lambda that you give to the function. I'm saying this all without actually trying it, but if you assume the iterator type is compatible between both libraries then there may not be any changes to that code. This was how the example I pointed to previously worked (just search ReduceInput in cub.h).
I'm not sure what your question about matxBinaryOp
is, but that's a wrapper class for any binary type and should be completely separate from this. matxBinaryOp
, like most of our types is an operator, and you can pass it to thrust/cub and have the iterator pull from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliffburdick I see. I should have clarified - I actually did try to follow that example you've shown me like this(but I did it incorrectly). If you're curious, here's what I did.
auto output_ = cuda::std::get<0>(out);
using out_t = decltype(output_);
using value_t = typename out_t::value_type;
using input_t = typename detail::base_type_t<OpA>; // incorrect, this is not any.h responsibility
using output_t = typename detail::base_type_t<out_t>; // incorrect
input_t in_base = a_;
output_t out_base = output_;
auto op = detail::reduceOpAny<value_t>();
auto fn = [&](input_t &&input,
output_t && output,
BeginOffset<input_t> &&begin,
EndOffset<input_t> &&end) {
return thrust::reduce(
input + *begin, input + *end, op.Init(), op
);
};
auto rv = ReduceInput(fn, out_base, in_base);
which brought me to my question of matXBinaryOp
. This led to errors like
has no member "type"
typedef typename thrust::iterator_system<InputIterator>::type System;
where InputIterator
was of type
matx::detail::matxBinaryOp<matx::detail::ConstVal<int, cuda::std::__4::array<matx::index_t, 2UL>>, matx::index_t, matx::detail::BinOp<int, matx::index_t, matx::detail::AddF<int, matx::index_t>>>
I should not be passing the base types though. Let me push the fix. Thanks for the help.
input + *begin, input + *end, op.Init(), op | ||
); | ||
}; | ||
auto rv = ReduceInput(fn, output_, a_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterators are thankfully compatible.
@cliffburdick Do you need anything done from my end? I imagine if so it would be relating to the executors? |
Hi @ZelboK is this ready to test? |
/build |
Yeah I recall running the tests which seemed to pass. Only thing is(even on |
Hi @ZelboK it's running, but we require all of our tests to pass. Unless it's an environment issue on your end we'll have to debug it. |
Hi @ZelboK , it looks like we're getting an exception in the tests:
Have you tried running them? |
Hi, Let me take a look when I have time later around Friday. Unfortunately I have less free time now than I'd like. I'll report back if I have troubles running the relevant tests. Curious to know if the test pipeline is only available through the NVIDIA VPN as an employee? I am not able to access the link. |
Hi @ZelboK , unfortunately no, there's no way you can view it, but I can look into that. You should be able to reproduce the exact same error by compiling the unit tests, then from your build directory run:
|
@cliffburdick Let me take a look now. By any chance are you ever on the nvidia developer discord btw? I know a lot of other teams like CCCL and cutlass participate. |
I didn't even know that existed, but I will try it out today or tomorrow. |
Saw the issue open for this and did this on a whim. I'm not too comfortable deleting code from this repository so for brevity I'll leave things as is. All this does is just use
thrust::reduce
instead of the defined kernelsSome clarifying points:
main
so I can't replicate a perfect test scenario.Executor
s I didn't spend too much time studying.