-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance bottleneck in Least Angle Regression #44
Comments
I should also add that I have tried to reproduce the issue "synthetically". In the script profiling/lar_comparison.py, I profile calls to the functions implementing the Least Angle Regression. I do so in three different ways: plain call to lar.solve passing numpy arrays; call to lar.solve passing numpy arrays that have just been converted from CuPy's arrays; call to the equivalent function in pyHSICLasso.
|
hisel
implements the HSIC Lasso algorithm of Yamada, M. et al. (2012). The algorithm consists of two parts: computation of Gram matrices and Least Angle Regression.I implemented the computation of Gram matrices in a more vectorized way than other existing implementations of the same algorithm (like pyHSICLasso). This yields some performance benefits. On top of that, I supported GPU acceleration using CuPy. This gives nice speedups - see the screenshot below.
GPU-acceleration gives roughly 3x speedup on the computation of Gram matrices. However, when I assess the performance of the overall algo (Gram + LAR), it seems that this speedup is lost. I do not understand why.
These are the top most expensive function calls of the CPU run:
And these are the most expensive function calls of the GPU run:
The computation of the Gram matrices is done via the call to apply_feature_map. You can see that the CPU run took 15.37 seconds, whereas the GPU run took 5.15 seconds. However, the overall algo (Gram + LAR) took 63 seconds on CPU and 100 seconds on GPU. The GPU run has to move tensors back from GPU to the CPU, and this causes some overhead: you can see it from the 48 calls to the CuPy function asnumpy, but these 48 calls collectively took 2.57 seconds, which does not explain the loss in performance. What explains the performance loss is the call to lar.solve: it took 47.17 seconds after computing the Gram matrices on CPU, and it took 94.14 seconds after computing the Gram matrices on GPU. I do not understand this. The function should be doing exactly the same thing in either of the runs, irrespectively of whether the Gram matrices passed to it were computed on CPU or on GPU (the move from GPU to CPU happens before the call to lar.solve).
Do you have an idea of why this issue occurs? Can you help resolve the bottleneck with lar.solve?
The profiling that I am reporting here was obtained using the script profiling/select_profile.py.
The text was updated successfully, but these errors were encountered: