-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement argminmax #697
Implement argminmax #697
Conversation
…urrently does many bad things
…gs compile, but have not tested
@tylera-nvidia this will need a docs page too |
|
||
// Set up CUB stuff | ||
auto d_in = thrust::counting_iterator<size_t>(0); | ||
auto min_op = minmaxer{in.Data()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the .Data() pointer will cause this to fail if the underlying tensor has been permuted. Ideally we'd rework RandomOperatorIterator and RandomOperatorOutputIterator to work correctly with Thrust.
At a minimum, suggest updating the docs (limitations.rst) to indicate this doesn't support permutations.
* CUDA stream | ||
*/ | ||
template <typename OutType, typename TensorIndexType, typename InType> | ||
void cub_reduce_custom(OutType minDest, TensorIndexType &minIdxs, OutType maxDest, TensorIndexType &maxIdxs, const InType &in, const cudaStream_t stream = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 'cub_reduce_argminmax' instead of 'cub_reduce_custom' since it doesn't accept an arbitrary custom compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps allow it to accept an arbitrary custom compare, but make this a 'cub_reduce_custom_4output' since it's outputting 4 values?
* is an existing reduce() implementation as part of matx_reduce.h, but this function | ||
* exists in cases where CUB is more optimal/faster. | ||
* | ||
* @tparam OutputTensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doxygen comments don't match code (OutputTensor vs OutType, etc).
* | ||
* @note Views being passed must be in row-major order | ||
* | ||
* @tparam OutputTensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doxygen comments don't match code
@tylera-nvidia can we close this one given #778 ? |
@tylera-nvidia I saw you just committed another patch, but is this OBE given #778? |
@cliffburdick I was doing some performance comparison with @tmartin-gh to make sure things were roughly equivalent. I can into some performance issues when the return tensor is size {1} versus size {0}. runtime goes from ~40us to >1ms, but we still get the right answer. My implementation did not exhibit that behavior, but it looks like it was due to my "dumb" pointer arithmetic in writing out the output. After resolving that, performance seems roughly comparable, so I'm going to close this PR. It looks like caching is currently broken in main, and we have some unprotected conditions for weird outputs, but those should all be fixed with new branches off main, not based on my old implementation. |
Implement a new argminmax function based on cub_reduce that calculates the following:
example call would be:
(matx::mtie(minVal, minIdx, maxVal, maxIdx) = matx::argminmax(inFlattened)).run();