-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Fast" solvers are slow for dense, complex matrix #159
Comments
How many Julia threads are open on your system? Show |
LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both |
Those are complex matrices. Or wait for the LoopVectorization rewrite to be finished. |
Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64? |
8 threads on my system: julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 8 |
Summary: - RecrusiveFactorization choice looks for Flaot32|Float64 and Julia threads. This should fix the main issue in #159 that RFLUFactorization is not optimized on complex matrices - CuArrays now supports LU factorizations - If a standard matrix, then use FastLUFactorization That last bit still has a question mark (#159 (comment))
Yes, they should. I mean, FastLUFact is just an alternative interface to BLAS that skips rebuilding the workspace each time, so it's not that it should be as fast as BLAS, it is a BLAS call 😅 . So that's a pretty confusing result here that needs to get explored a bit more. I opened a PR that tweaks the default algorithms a bit more: #160 . Though I think we should figure out what's going on with FastLUFactorization. |
LoopVectorization.jl currently only supports If we have a special need for it somewhere, I could take the time to hack on support to LoopVectorization.jl, otherwise it's better to wait for the rewrite, where it will come for free. |
I'm happy to wait for the rewrite. However, I'm also not seeing the advertised speedup on Float64:
|
https://github.com/SciML/LinearSolve.jl/blob/main/src/default.jl#L17 |
So the default algorithm was selected to be |
I need to investigate what's happening in FastLUFactorization. This should be taking exactly as long as Backslash, until you do a re-solve where it should be faster. Should be easy to find. I wonder if this is also true for Float64. |
Can you run the benchmarks? https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/master/perf/lu.jl And share your CPU? It at least was about 2x faster for things we tested JuliaLinearAlgebra/RecursiveFactorization.jl#28 |
Summary: - RecrusiveFactorization choice looks for Flaot32|Float64 and Julia threads. This should fix the main issue in #159 that RFLUFactorization is not optimized on complex matrices - CuArrays now supports LU factorizations - If a standard matrix, then use FastLUFactorization That last bit still has a question mark (#159 (comment))
We should make a version of that just do all of the LinearSolve.jl options. I updated a few of the defaults in #160. I'll keep the cutoff at 500 for now since RF is never "bad", though I did change it to move away from RF if the user doesn't have Julia threads enabled. OpenBLAS has some bad CPUs, so that is a bit safer, until large. But FastLUFactorization will stay out of the defaults until we figure out what's going on there (it's just using BLAS, so how is it slower 😅). |
Thanks for your fast response on this issue. Looking forward to using this for complex matrices (ubiquitous in my work) in the future. |
How many threads did you use in your benchmark? |
Julia was started with 8 threads. BLAS threading was set by the script: nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc) which looks like it would be 8 as well. |
Yup so that fits the heuristic of the default algorithm:
If <100 or if openBLAS less than 500, use RecursiveFactorization. Otherwise use the (provided) BLAS. So I think that's pretty close to what you measure. I wonder if we could provide MKLLUFactorization which doesn't overload LBT? |
Maybe we could even raise the 100; MKL only beat it at 160. I'd also be interested in seeing single threaded plots of both, which would be relevant for any code trying to do solves in parallel, and thus single threading the
We could, by depending on |
I'm almost done with some similar runs on Windows. Want me to put the plots in a comment in #166? |
Yes please! |
FastLUFactorization in v2 is optimized to be non-allocating and all, see #313 . But it doesn't seem to make a real difference since that workspace is not the main cost, and the |
According to the documentation,
RFLUFactorization
is supposed to be faster than backslash for matrices on the order of a few hundred. I did some quick benchmarking and I'm not finding this to be the case. Am I doing something wrong?I'm running Julia v1.7.3 and LinearSolve v1.22.1
Also, I found that
FastLUFactorization
is not exported so I had to invoke it asLinearSolve.FastLUFactorization()
.The text was updated successfully, but these errors were encountered: