-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up the computation of sliding dot product with fft #938
Comments
@NimaSarajpoor I just noticed that |
@NimaSarajpoor I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT license. However, I tried to implement the basic
Would you mind taking a look? Maybe I messed up somewhere but I've been staring at it for too long and I'm not able to spot anything. Thanks in advance! |
Cool!
Sure! Will take a look. Also: According to the scipy doc:
I will test again and share the result and code for our future reference. |
Currently,
And, this is the code for tracking the running time for different window sizes
As observed:
|
Thanks @NimaSarajpoor. In case it matters (and if you're not already doing this), it would make sense to test window sizes and/or time series lengths in powers of |
[Note] [update] Correction regarding the label of x axis in the bottom figure is: "the length of query " |
It turns out that
And, to test it:
|
Hmmm, I wonder why they performed the division?! Thanks for figuring it out. I just ported it over blindly without trying to understand it 🤣. How about the
This doesn't seem to match Would you mind doing a performance comparison if you are able to crack this? |
This should work:
I am working on some minor changes to boost the performance. I will share the performance of both original version, and the enhanced version, and will compare them with the |
For now, I did some enhancements on the new fft / ifft functions suggested in #938 (comment). Part (I): These are the description of the four versions:
For a time series with length And, we can zoom in further by removing the Part (II): As observed, the gap in the performance becomes bigger as the length of time series increases. The code is available in the notebook pushed to this PR #939. Next steps: @seanlaw |
If I understand correctly, the stockham algorithm is NOT faster than I wonder if there might be some clues in the OTFFT source code in terms of how they might have parallelized it using OpenMP. I'd be pretty happy if we could get within 2x (slower) than FFTW. I'm also guessing that the sawtooth shape observed in |
Yes. After some initial optimizations, I can see that the stockham algorithm (from OTFFT) is slower. So, as you mentioned:
I will try to go through it to get some idea. Need to run the tests on MATALB online server if we want to consider MATLAB FFTW as our benchmark.
I think it gave us some boost. See the figure below...
|
It appears that maybe we should consider implementing the six step or eight step FFT algorithm next as it should have much better memory locailty and is therefore "optimized". I'd expect this to be faster than our current sliding dot product. I'm not sure how/if any of the radix variants (shown at the same link above) will help. |
I have implemented six-step-FFT. I will work on eight-step FFT algorithm. [Note to self]
|
In case it matters, section 6.3 might be relevant as it discusses Stockham and the 6-step algo. More importantly, it describes how/why cache memory is important |
According to this Mathworks webpage, MATLAB is equipped with built-in multithreading.
A few notes: (2) We can implement 6-step / 8-step fft, and parallelize it, and then compare it with MATLAB code. we use MATLAB online server. (3) should we expect the |
Based on these performance results I expect the |
[Update]
I will go through section 6.3 and check the algo again to better understand it, and see if I am doing something wrong in my code. I may start with a notebook, and implement each function in one cell to go through them together with you, @seanlaw . Also, need to take a look at the OTFFT source code |
This is 8-step with |
It's surprising that it jumps so dramatically at 2^24 to 2^27. I wonder if FFTW suffers from this too or if it still scales linearly. Technically, this algo should be |
I do not know what is happening there. I wanted to try As suggested by @seanlaw in #939 (comment), we are going to take advantage of In the following figure, I will work on finding opportunities to improve the performance. According to #938 (comment), we might be able to just use
|
It seems that MATLAB FFTW shows similar behaviour! (see below) Also, the following figures confirm that six-step / eight-step FFT outperform FFTW in MATLAB Let's zoom in for MATLAB
Python
The https://gist.github.com/NimaSarajpoor/5cf2d4f0b0aad5e0898651dddff35c17 @seanlaw Maybe we just ignore the recent pushes to the new notebook, and I start over by adding one function at a time. |
Amazing! Can you confirm whether MATLAB and 6/8-step Python are single threaded or multithreaded?
Please give me some time to review and respond |
Thanks @NimaSarajpoor. Can you please provide the raw timing data in a table? I can see that we are doing better at longer time series lengths and with 8 threads. This is great. However, I think the majority of use cases will likely be in the |
So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html |
I see! yeah... that might be the reason. So, I am now thinking of looking at MATLAB-vs-Python comparison from two perspectives: (1) I can use (2) I can just consider the running time of the first run. My understanding is that a plan for |
[WIP] In the meantime, let's review the OFFT 6-step FFT algorithm (which is for time series 6-step FFT Note: Reminder: Regarding tranpose step: In our implementation, we do:
instead of
So, our for-loop implementation should be okay. I am stopping here regarding |
Profiling 6-step FFTNow that we have a (somewhat) clear definition of six-step FFT (see previous comment), we can wrap a njit-decorated function around each step, and start profiling the six-step FFT algo at "step" level.
As provided above, our Note that the log2 of length of input in
Each row of Note1: I did not plot the running time for input with length 2^p, p < 10, as the calculated running time for almost all steps were zero. Note that the transpose part (steps 1 & 6 together) takes around Comparing the performances of different functions for "Transpose" step
Note: Numba does not support the second argument, I used the following code to get the performance of each function for inputs with different lengths,
We use the performance of performance ratio (in MacOS)A value below the horizontal line Note: It is interesting to see that the function func5 (and func6) ....
performs poorly for len(T) between 2^16 and 2^20 (inclusively). It is interesting to note that we saw a similar bump in the performance of FFT before. Can this be the reason? Also, why do we have a bump in the performance here? Interestingly,
shows a good performance overall. However, it is considerably outperformed by I also ran the code in Linux OS (using Google Colab) performance of transpose functions (in Linux OS via Google Colab, with two threads) |
This cache-oblivious transpose algorithm (and matrix multiplication algorithm) might be useful for future reference as it relates to blocking/tiling |
(TLDR) Tiling seems to be effective for matrix transpose! (Check out Let's set our baseline first. Of the three following functions, we choose
Alternative, more performant functions:
In addition to the functions above, we are going to consider one more function based on the algorithm proposed in Cache-Oblivious Algorithms.
The code to check the performance is provided in this Colab notebook. The following result is obtained by running the code on my MacOS machine (not the Google Colab). I got 500 samples of running time per case, and used its median. Each data point in the plot below is the running time ratio w.r.t the running time of baseline function. Running the code in Colab gives (roughly) similar result (In Colab, I got 100 samples per case). I think the winner is Previously, we noticed that a @seanlaw |
@NimaSarajpoor That's awesome! Great work. It's nice to see that tiling works so well to help reduce cache misses! Also, there's no prange/multithreading either which is good. It's really great and rewarding to be able to learn and leverage published academic research like this. I really wonder if we can rethink tiling for the Should we really turn on fastmath? If so, we will need to remember to correct the final output if the original inputs are not finite right? So a |
Yes! This is the first time I see the benefit of considering cache with my own eyes 😅 Thank you for the guidance! That was a very good lesson for me! I still need to study more about it, but I think that was a very good experience!!!
Right! We can revisit it.
Good question! I am going to check a few things: (1) what will be the performance after removing fastmath? (or just considering certain flags?) Btw, STUMPY cares about sliding-dot-product. So, the private |
(1) Removing (2) Let denote |
I understand. If we only do
In other words, the |
I also forgot to ask if you've played around with |
Makes sense! Got the point. I will address it after wrapping up the private function.
I recently thought about this as I know you usually prefer to use config variable, if possible / needed, rather than having a constant value in a code. Regarding [WIP] Notebooks for profiling different components of our FFT implementation Also, I have been profiling EACH block of code, to figure out what works best per block. For instance, regarding Compress T (To be continued) Also, another thing I have been trying out is to use pre-computed factors instead of computing each factor in fft algorithm. (This is based on what I read before somewhere(?). It is one of the ways one can use to further improve FFT algorithm) The values of these factors ONLY depend on the length of input. For instance, let's take a look at the following block of code that is used in the last stage of
We can see a similar part in the 8-step fft algroithm, which is being called when
We can save factors in an array of length Note: It is cleaner for sure If we can reach a reasonable performance without this! |
You might be able to use functools.cache (rather than
So, when you call either of these functions, Python will check whether the given
|
@NimaSarajpoor I discovered something really interesting while looking into cache-oblivious FFT algorithms. The inventor of cache-oblivious algorithms, Matteo Frigo, had provided a cache-oblivious version of FFT in his original paper (see Section 3) and it turns out that Frigo is also the creator of FFTW! Which, you guessed it, uses a cache-oblivious algorithm! I am starting to think that (in a separate PR) it might be helpful to create a notebook to help explain cache-oblivious algorithms (with diagrams) starting with the basics of how cache sits in between the CPU and main memory (RAM), to understanding blocks, a simple for-loop of how to divide up a matrix and traverse blocks, and how it can be used in a matrix |
Precomputing the array has not shown promising result. I am going with the
Interesting! I skim the paper before. And, our current cache-oblivious matrix transpose is basically following the approach proposed in the paper. I didn't notice though that he is one of the authors of FFTW! It might be too soon to say this (?) but it seems we are on the right track!
Agreed! I think that will be fun! |
[Update]
Apparently the decorators To the see the impact of cache on performance
And the problem appears here:
In fact, |
Hmm, this seems surprising. Can you share a small, self contained, reproducible example with only the sine/cosine parts? |
Surprising indeed! Will do! Also need to share one to show the impact of different block sizes on performance. I am trying to see if I can replace the fft recursive function with for-loop. Can that affect the performance? I read that it might be faster as a recursive function pushes each call to stack. I will continue working on it for the next few days and provide an update. |
I created three version and put them all in this notebook. The first version is the original version. The second version takes advantage of the identity equation [Update] |
I checked the impact of blocksize in cache-oblivious algorithm for matrix transpose. You can see this notebook. I do not see any (significant) improvement in changing the blocksize. |
I wrote the for-loop version of the recursive function we use in 6-step FFT (see this notebook). Ran it in MacOS and noticed no performance gain. |
Here a few observations for you to consider:
|
You are right (re: isn't even called by I initially wrote this function to be used in
I think it is worth it to use
The private function is actually a recursive function. I can convert it to a for-loop, and then keep everything in one function. I will work on it.
Two notes: (2) In DAMP, for example, with each new batch of data point, we need to use FFT. Suppose that the window size
Right! Will test its performance as it is definitely cleaner! |
To properly show the impact of using precomputed array, I can write a function to call |
[WIP]
The Colab notebook is now modified to address the point above. I also ran the code in my MacOS and got the following result (code) In BOTH functions Now let's consider the full steps 2 & 5 (of our 6-step algorithm) rather than just its cos/sin part (code). I ran it in my MacOS and got the following result: Note that |
Btw is |
Right!😅 Thanks for catching it! Will fix it. |
[Update]
Here, I am providing some important links to external resources or the comments mentioned here:
range(17, 21)
.Currently, in Stumpy, the sliding dot product [of a query Q and a time series T], is computed via one of the two following functions:
core.sliding_dot_product
, which takes advantage of fft trick usingscipy.signal.convolve
core._sliding_dot_product
, which uses a njit on top ofnp.dot
The sliding dot product in MATALB (via fft trick) seems to be faster though.
Can we get closer to the performance of MATLAB?
The text was updated successfully, but these errors were encountered: