Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subgroup Level Bluestein Algorithm #145

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Subgroup Level Bluestein Algorithm #145

wants to merge 23 commits into from

Conversation

AD2605
Copy link
Contributor

@AD2605 AD2605 commented Mar 14, 2024

Adds the Bluestein Algorithm at the subgroup level an enables the following support in portFFT

  • 1D Prime sized DFTs which after padding fit in the subgroup level, and supports the following data layout:

    • Supports SPLIT_COMPLEX AND INTERLEAVE_COMPLEX storage scheme
    • Supports PACKED and BATCH_INTERLEAVED storage
  • 2D and 3D FFTs where each prime dimension after padding must fit in the subgroup level

    • Supports SPLIT_COMPLEX AND INTERLEAVE_COMPLEX storage scheme

Checklist

Tick if relevant:

  • New files have a copyright
  • New headers have an include guards
  • API is documented with Doxygen
  • New functionalities are tested
  • Tests pass locally
  • Files are clang-formatted

@Rbiessy
Copy link
Collaborator

Rbiessy commented Mar 14, 2024

Thanks for the PR, I won't have time to review it myself until next week

@Rbiessy Rbiessy requested a review from t4c1 March 14, 2024 09:34
Copy link
Contributor

@hjabird hjabird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to review if some of the changes to transfer functions were reverted.

.clang-tidy Outdated Show resolved Hide resolved
src/portfft/committed_descriptor_impl.hpp Outdated Show resolved Hide resolved
Idx used_sg_size;
Idx num_batches_in_l2;
Idx num_factors;
bool is_prime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Now we have length and committed_length, what does this (and eg. num_factors) refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'length' and 'committed length' fields are used calculation of twiddles and setting some spec-constants like fft size.
Later down the control flow when we need to calculate twiddles , I would've repeatedly need to compare length and committed_length to check if there is any padding or not, instead this is set in the constructor of dimension_struct itself, indicating that a prime algorithm is used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lost - I don't think you've really answered my question of whether num_factors is a the number of factors of committed_lenght or length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to add the line about num_factors,
num_factors is the number of factors of length

src/portfft/committed_descriptor_impl.hpp Show resolved Hide resolved
src/portfft/committed_descriptor_impl.hpp Show resolved Hide resolved
test/unit_test/fft_test_utils.hpp Show resolved Hide resolved
src/portfft/utils.hpp Outdated Show resolved Hide resolved
src/portfft/defines.hpp Outdated Show resolved Hide resolved
* @param global_data global_data_struct associated with the kernel launch
*/
template <detail::level Group, detail::transfer_direction Direction, Idx SubgroupSize, typename LocView, typename T>
PORTFFT_INLINE void local_global_packed_copy(T* global_ptr, LocView& loc_view, IdxGlobal global_offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these new transfer functions are refactoring that - as far as I can tell - is not required for the Bluestein implementation. I'd prefer it if they could be reverted, so that we could discuss the course we want to take with them in a separate PR in the future.

Copy link
Contributor Author

@AD2605 AD2605 Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not have done Bluestein without the refactoring, as it otherwise would have been very difficult to debug. The number of lines in subgroup_impl has reduced significantly, whereas previously I found myself scrolling for some time and as it turned out I had only traversed a small fraction of the entire kernel !

Also, quite a lot of refactoring was necessary . For example in the sg_bluestein functions, when I need to load/store from/to local memory, and how sg_bluestein_packed_copy uses local_global_packed_copy.
Without the refactor, this would have resulted in a massive amount of duplicated code (repeated view creations)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we can keep perfecting the refactor by taking it further from here later on whenever we come back to it, but at least it leaves the code in a usable state. I have applied the same to workgroup level bluestein as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Refactoring should be in a separate PR. Nothing here looks essential for Bluestein.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be a separate PR, even if the code might look ugly.

  • On principal, 1 PR should only do 1 thing.
  • We used to have lots of transfer functions, do we want that again?
  • Separating out functions like this makes it harder to unify the implementation later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On principal, 1 PR should only do 1 thing.

...Ideally, assuming given sufficient time and all the resources (including the ease of implementation) available to implement the goal of the PR. This is also why I have always advocated for a refactor as you go approach before putting up the PR

We used to have lots of transfer functions, do we want that again?

If anything they make the code more verbose, and you can read the order our operations as if reading phrases of English, i,e, copy_global_to_local ---> local_to_private --> compute --> private_to_local ---> local_to_global. All the complexity involved with how many views to be created is shifted elsewhere. We can keep trickling this down to our other sections later on, and adjust if required.

Sorry for taking a firm stance here, but I do not suppose I will be reverting this change, given the time crunch and amount of things I would need to change, and no downside associated with it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just catching up on this comment, I think Hugh made good points here. We're also trying to push for splitting the PRs as much as possible and discussing early when a large change is needed. Given that we have already refactored transfer functions multiple times we would have needed long discussions which we don't have time for anymore.

I strongly suggest that we don't bother merging Bluestein in main but instead:

  • Keep all the Bluestein work in a bluestein branch.
  • Create and close the other 2 PRs for Bluestein that you had plan if they are ready already. The could be useful if we try to merge this in main again in the future.

src/portfft/dispatcher/subgroup_dispatcher.hpp Outdated Show resolved Hide resolved
.clang-tidy Outdated Show resolved Hide resolved
std::size_t length;
// The committed length for the particular dimension, will be different from length in the case of bluestein and
// radar fft algorithms
std::size_t committed_length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t committed_length;
std::size_t padded_length;

The other lenght is also committed, making this variable name confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think committed_length is confusing. How about ct_length (standing for Cooley-Tukey)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed_length is not the cooley_tukey length. Committed length is the user provided length (i.e the length of the dimension as in the descriptor)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the other one could be ct_length. We just need something to make it obvious which is which.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be ct_length, we should not be tying it with the name of an algorithm, as it isn't always cooley tukey.

I believe length, along with its description above should be suffiicient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which other algorithm can it be?

src/portfft/committed_descriptor_impl.hpp Outdated Show resolved Hide resolved
src/portfft/committed_descriptor_impl.hpp Outdated Show resolved Hide resolved
src/portfft/committed_descriptor_impl.hpp Outdated Show resolved Hide resolved
src/portfft/common/subgroup.hpp Outdated Show resolved Hide resolved
Idx local_offset, Idx n_elements_to_copy,
detail::global_data_struct<1>& global_data) {
global_data.log_message(__func__, "storage scheme: INTERLEAVED_COMPLEX");
if constexpr (Direction == detail::transfer_direction::GLOBAL_TO_LOCAL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this function is doing is that if based on template parameter. Replace the calls to this functions by calls directly to global2local or local2global. Then you can delete this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just an extra step to call local2global / global2local, but the idea here was to abstract away view creations at the call site

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think views on the call site make code significantly easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

View creation at the call site adds a lot of extra lines. I have been able to shave off ~120 lines from subgroup_dispatcher function simply by shifting the view creation elsewhere. It had became very difficult for me go through the code and make changes at the required places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that was at a cost of reducing code readability and adding double that number of lines to transpose. Not worth it in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The views are still being used, only the construction of views is now shifted to a separate function, and rightfully so

What transposition are you referring to ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that was a typo. I meant transfers.hpp.

IdxGlobal global_offset, Idx local_offset, Idx local_imag_offset,
Idx n_elements_to_copy, detail::global_data_struct<1>& global_data) {
global_data.log_message(__func__, "storage scheme: SPLIT_COMPLEX");
if constexpr (Direction == detail::transfer_direction::GLOBAL_TO_LOCAL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the previous function. Actually the same can be done for the other new functions in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining this comment with the one above

* @param global_data global_data_struct associated with the kernel launch
*/
template <detail::level Group, detail::transfer_direction Direction, Idx SubgroupSize, typename LocView, typename T>
PORTFFT_INLINE void local_global_packed_copy(T* global_ptr, LocView& loc_view, IdxGlobal global_offset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Refactoring should be in a separate PR. Nothing here looks essential for Bluestein.

Comment on lines +570 to +579
for (Idx i = 0; i < factor_sg; i++) {
for (Idx j = 0; j < factor_wi; j++) {
// Not using sycl::cospi / sycl::sinpi as std::cos/std::sin provides better accuracy in float and double tests
// Also why this was moved to host, this way the tolerance value needs to be bumped up by a smaller value
double theta = -2 * M_PI * static_cast<double>(i * j) / static_cast<double>(factor_wi * factor_sg);
auto twiddle = std::complex<Scalar>(static_cast<Scalar>(std::cos(theta)), static_cast<Scalar>(std::sin(theta)));
host_twiddles[static_cast<std::size_t>(j * factor_sg + i)] = twiddle.real();
host_twiddles[static_cast<std::size_t>((j + factor_wi) * factor_sg + i)] = twiddle.imag();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you moving calculation of twiddles to host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better precision, I could lower the threshold value as result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold which is calculated in the tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following - can you point me to the code?


// TODO: Currently local memory is being used to load the data back in natural order for the backward phase, as the
// result of sg_dft is transposed. However, the ideal way to this is using shuffles. Implement a batched matrix
// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we
// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. This we

// TODO: Currently local memory is being used to load the data back in natural order for the backward phase, as the
// result of sg_dft is transposed. However, the ideal way to this is using shuffles. Implement a batched matrix
// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we
// way can even avoid the 2 sg_bluestein functions that we have today
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which 2 functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg_bluestien packed and sg_bluestein batch_interleaved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that change let us avoid them?

Idx local_offset, Idx n_elements_to_copy,
detail::global_data_struct<1>& global_data) {
global_data.log_message(__func__, "storage scheme: INTERLEAVED_COMPLEX");
if constexpr (Direction == detail::transfer_direction::GLOBAL_TO_LOCAL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that was at a cost of reducing code readability and adding double that number of lines to transpose. Not worth it in my opinion.

}
} // namespace portfft

#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline at the end of the file is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants