Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement argminmax #697

Closed
wants to merge 10 commits into from
Closed

Implement argminmax #697

wants to merge 10 commits into from

Conversation

tylera-nvidia
Copy link
Collaborator

Implement a new argminmax function based on cub_reduce that calculates the following:

  • an index of the data's minimum
  • an index of the data's maximum
  • the data's max
  • the data's min

example call would be:

(matx::mtie(minVal, minIdx, maxVal, maxIdx) = matx::argminmax(inFlattened)).run();

@tylera-nvidia tylera-nvidia added the enhancement New feature or request label Aug 6, 2024
@tylera-nvidia tylera-nvidia self-assigned this Aug 6, 2024
@tylera-nvidia tylera-nvidia linked an issue Aug 6, 2024 that may be closed by this pull request
@cliffburdick
Copy link
Collaborator

@tylera-nvidia this will need a docs page too


// Set up CUB stuff
auto d_in = thrust::counting_iterator<size_t>(0);
auto min_op = minmaxer{in.Data()};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the .Data() pointer will cause this to fail if the underlying tensor has been permuted. Ideally we'd rework RandomOperatorIterator and RandomOperatorOutputIterator to work correctly with Thrust.

At a minimum, suggest updating the docs (limitations.rst) to indicate this doesn't support permutations.

* CUDA stream
*/
template <typename OutType, typename TensorIndexType, typename InType>
void cub_reduce_custom(OutType minDest, TensorIndexType &minIdxs, OutType maxDest, TensorIndexType &maxIdxs, const InType &in, const cudaStream_t stream = 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 'cub_reduce_argminmax' instead of 'cub_reduce_custom' since it doesn't accept an arbitrary custom compare?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps allow it to accept an arbitrary custom compare, but make this a 'cub_reduce_custom_4output' since it's outputting 4 values?

* is an existing reduce() implementation as part of matx_reduce.h, but this function
* exists in cases where CUB is more optimal/faster.
*
* @tparam OutputTensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen comments don't match code (OutputTensor vs OutType, etc).

*
* @note Views being passed must be in row-major order
*
* @tparam OutputTensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen comments don't match code

@cliffburdick
Copy link
Collaborator

@tylera-nvidia can we close this one given #778 ?

@cliffburdick
Copy link
Collaborator

@tylera-nvidia I saw you just committed another patch, but is this OBE given #778?

@tylera-nvidia
Copy link
Collaborator Author

@cliffburdick I was doing some performance comparison with @tmartin-gh to make sure things were roughly equivalent. I can into some performance issues when the return tensor is size {1} versus size {0}. runtime goes from ~40us to >1ms, but we still get the right answer. My implementation did not exhibit that behavior, but it looks like it was due to my "dumb" pointer arithmetic in writing out the output.

After resolving that, performance seems roughly comparable, so I'm going to close this PR. It looks like caching is currently broken in main, and we have some unprotected conditions for weird outputs, but those should all be fixed with new branches off main, not based on my old implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] add argminmax function
3 participants