You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think that BLAS, in this day and age, is becoming something of a misfeature...
Compilers have come so far that not only is it becoming rarer to see BLAS code outperform compiled code, but compilers would absolutely consider it a defect if handwritten code can meaningfully outperform compiled code on things as simple as DGEMM.
So while BLAS is certainly a quick way to sometimes get good performance on some problems for some problem sizes on some specific platforms, you might want to consider eventually deprecating BLAS.
You might find https://github.com/romeric/Fastor interesting - it has "meta-level" handwritten code (e.g. for AVX512) that, for me at least, outperforms both MKL and BLIS, significantly so for small problem sizes or the moment there's meaningful opportunities for inter-procedural (caller/callee) optimization.
The text was updated successfully, but these errors were encountered:
The long term goal is to eventually abandon BLAS and use std::simd. As far as I know it's still in experimental case and it's still actively been worked on. By using it, we can have platform independent code that will hopefully perform as good or outperform BLAS or even vendor specific BLAS versions like Blis or intel MKL.
Hi there,
I think that BLAS, in this day and age, is becoming something of a misfeature...
Compilers have come so far that not only is it becoming rarer to see BLAS code outperform compiled code, but compilers would absolutely consider it a defect if handwritten code can meaningfully outperform compiled code on things as simple as DGEMM.
So while BLAS is certainly a quick way to sometimes get good performance on some problems for some problem sizes on some specific platforms, you might want to consider eventually deprecating BLAS.
You might find https://github.com/romeric/Fastor interesting - it has "meta-level" handwritten code (e.g. for AVX512) that, for me at least, outperforms both MKL and BLIS, significantly so for small problem sizes or the moment there's meaningful opportunities for inter-procedural (caller/callee) optimization.
The text was updated successfully, but these errors were encountered: