-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup CAReduce C-implementation with loop reordering #971
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bf9f6e6
to
054a5e4
Compare
ab1fbdd
to
756170b
Compare
79d5600
to
cfc8c77
Compare
cfc8c77
to
061ed3a
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #971 +/- ##
==========================================
+ Coverage 81.69% 81.70% +0.01%
==========================================
Files 182 182
Lines 47585 47590 +5
Branches 11584 11587 +3
==========================================
+ Hits 38875 38884 +9
Misses 6518 6518
+ Partials 2192 2188 -4
|
ricardoV94
commented
Aug 14, 2024
Comment on lines
+594
to
+598
for(int i = 0; i < n; i++) | ||
{ | ||
npy_float64 &acc_i = acc_iter[i]; | ||
acc_i = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works because we allow allocate a new contiguous array for the output
twiecki
approved these changes
Aug 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds runtime loop re-ordering for CAReduce operations in the C-backend, to alleviate the bad performance reported in #935
Using the new benchmark test, I confirm substantial speedup on summation along leading axis (30-110x) depending on C-contiguity or not. A full reduction on a non C-contiguous array is also 7.5x faster than before
On the down side, there are some slowdowns when summing the last axis, but those are smaller in magnitude (2-3.3x). This is probably harder for the compiler to optimize due to the many to one location when iterating over the array (not trivially parallelizable).
The old speedup had probably to do with defining a temporary allocation variable and only writing it to the output after consuming all the reduced axis at each step. I didn't want to add a special case for this (only works when the reduced axes are the ones with smallest strides).
If people are concerned I can investigate adding a special case. However I'm already happy with reducing the worst case scenarios, which were really lame.
Related Issue
Checklist
Type of change